Design Real-Time Chat and Messaging

A staff-level design for real-time chat at the scale of WhatsApp, Slack, and Discord — sub-second delivery to hundreds of millions of devices, no message loss, and stable per-conversation ordering — built around three deliberate trade-offs: WebSocket transport with stateful gateways, at-least-once delivery with client deduplication, and a hybrid fan-out that picks direct-push or queue-backed delivery per conversation size.

Real-time chat high-level architecture: clients connect to stateful WebSocket gateways through an edge tier; the message service writes a per-conversation log with sequence numbers and triggers either direct push (small groups) or Kafka-backed fan-out (large groups); presence and sync run as separate services backed by Redis, ScyllaDB, and PostgreSQL. — Stateful WebSocket gateways front a message service that writes a per-conversation log; small groups go through direct push, large groups through a Kafka-backed fan-out worker pool, and offline devices catch up through a sync service.

Abstract

A real-time chat system has to satisfy three constraints simultaneously:

Low latency — messages appear within a few hundred milliseconds, even across regions.
Reliable delivery — no message is lost across crashes, network partitions, or client offline windows.
Stable ordering — every participant sees the same sequence per conversation; no time travel, no rewind on reconnect.

The deliberate trade-offs that make those constraints affordable:

Decision	Choice	Why
Transport	WebSocket over TLS	Full-duplex, low-overhead frames after a single HTTP/1.1 upgrade (RFC 6455).
Delivery semantics	At-least-once + client dedup	Exactly-once is achievable but expensive; client-side idempotency on a stable `messageId` is simpler and faster.
Ordering	Server-assigned sequence numbers per conversation	Single source of truth per partition avoids cross-node clock-skew problems described by Lamport.
Fan-out	Hybrid push/pull	Direct push for small groups; queue-backed fan-out for large channels and celebrity feeds.
Presence	Heartbeat + Redis pub/sub	Ephemeral; reconstructed on connect; tolerates lossy updates.
Offline sync	Per-device sequence cursor	Replay from message store on reconnect; no server-side per-device queue to manage.

Trade-offs accepted in exchange:

A central server is on the critical path for ordering; pure peer-to-peer is excluded.
Clients must deduplicate on messageId.
Presence is eventually consistent (a few seconds of staleness).
Durable queues and replicated stores cost more than fire-and-forget delivery.

What this design optimizes for:

p99 message delivery under 500ms within a region.
Zero message loss across reconnects, gateway restarts, and Kafka rebalances.
Seamless offline → online transitions on flaky mobile networks.
Linear horizontal scaling to billions of messages per day.

Requirements

Functional Requirements

Requirement	Priority	Notes
1:1 direct messaging	Core	Private conversations between two users
Group messaging	Core	Up to ~1000 members per group
Sent / delivered receipts	Core	Per-recipient state on each message
Typing indicators	Core	Bursty, ephemeral, lossy-by-design
Online/offline presence	Core	Show user availability status
Offline message delivery	Core	Queue and deliver when user reconnects
Multi-device sync	Core	Same account on phone + tablet + web simultaneously
Message history sync	Core	Retrieve past messages across devices
Read receipts	Extended	Per-conversation watermark
Media attachments	Extended	Images, videos, files (out of detailed scope)
End-to-end encryption	Extended	Signal Double Ratchet for 1:1; Sender Keys or MLS for groups

Non-Functional Requirements

Requirement	Target	Rationale
Availability	99.99% (≤52 min/year)	Communication is critical; failover must be seamless.
Message delivery latency	p99 < 500ms intra-region	Required for “real-time feel”.
Message durability	99.9999% (six nines)	A lost message is a product-killing bug.
Offline sync time	< 5s for 1000 missed messages	Fast reconnection on flaky mobile networks.
Concurrent connections	10M per region	Mobile-scale.
Message retention	30 days default, configurable	Storage cost vs. user expectations.

Scale estimation

Users:

Monthly Active Users (MAU): 500M
Daily Active Users (DAU): 200M (40% of MAU)
Peak concurrent connections: 50M (25% of DAU)

Traffic:

Messages per user per day: 50 (mix of 1:1 and group)
Daily messages: 200M × 50 = 10B/day
Peak messages per second: 10B / 86,400 × 3 (peak multiplier) ≈ 350K msgs/sec

Storage:

Average message size: 500 bytes (text + metadata)
Daily storage: 10B × 500B = 5TB/day
30-day retention: 150TB
With 3× replication: 450TB

Connections:

WebSocket connections per gateway pod: ~500K (memory-bound; the kernel can do more — WhatsApp reportedly ran ~2M concurrent TCP connections per server on tuned FreeBSD/Erlang, then dialed back to ~1M for headroom).
Gateway pods needed: 50M / 500K = 100 minimum
With 2× redundancy and burst headroom: 200 pods

Mental model: what makes chat hard

Two ideas explain almost every design choice in this system.

1. The message service is the truth, not the network. The send is not “complete” when the WebSocket frame leaves the sender; it’s complete when the message has a server-assigned sequence number and is durable. Everything before that is best-effort UI. Everything after is replayable.

2. Conversations, not users, are the unit of consistency. Ordering is per-conversation. Two unrelated conversations don’t need a global clock. Two devices for the same user can disagree on what they’ve seen; they reconcile against the per-conversation cursor on reconnect.

A senior engineer reading the rest of the article should be able to map every component back to one of these ideas.

Design paths

Three viable architectures sit at different points on the latency-vs-fan-out spectrum.

Path A — Connection-centric (server-routed)

Every gateway holds a user_id → connection_id map; the message service does direct RPC into the recipient’s gateway.

Direct-push delivery sequence: sender's gateway forwards to the message service, which dedups, persists with a sequence number, acks the sender, and pushes to the recipient's gateway over an internal RPC. The recipient acks back through the same chain to drive the delivered receipt. — Direct push: sender → gateway → message service → recipient's gateway → recipient. Sequence numbers and dedup happen in the message service before the ack returns.

Best when:

Latency is the dominant requirement.
Group sizes are bounded (say, < 500 members).
Strong per-conversation ordering is required and cross-region traffic is moderate.

Trade-offs:

Lowest possible latency; the path is one RPC hop.
Simple mental model.
Strong per-conversation ordering — the message service serializes writes.
Gateway state management complexity: graceful drain, connection migration on shutdown.
Limited group size — fan-out cost is borne synchronously by the message service.

Real-world parallel: WhatsApp scaled long-running TCP connections on FreeBSD/Erlang to ~1–2M connections per box, with messages routed directly between connection-holding nodes.

Path B — Queue-centric (async fan-out)

Messages get persisted, then published to a partitioned log; fan-out workers consume and deliver.

Kafka fan-out architecture: senders publish through a gateway and message service into a Kafka topic partitioned by conversation_id. A pool of fan-out workers consumes the partitions and pushes to gateway pods and the push gateway in parallel. — Queue-centric fan-out: messages land in a Kafka topic partitioned by conversation_id; workers consume partitions and push to gateways and APNs/FCM. Decouples ingest from delivery for replay and back-pressure.

Best when:

Very large groups or channels (1,000+ members).
Geographic distribution across regions.
100–500ms extra latency is acceptable.
You need replay and audit (e.g., compliance, analytics).

Trade-offs:

Handles large fan-out efficiently and back-pressure flows naturally through partition lag.
Built-in replay capability; consumers can rewind.
Better failure isolation — a slow consumer affects one partition, not the whole gateway pool.
Higher latency: the queue hop adds tens to hundreds of milliseconds.
More moving parts (Kafka itself is a system to operate).
Ordering only holds within a partition, so the partition key matters.

Real-world parallel: Slack runs a self-driving Kafka platform at ~6.5 Gbps peak across 10 clusters and ~700 TB — but Kafka is the backbone for jobs, logs, billing, and analytics rather than the chat fan-out itself. Their per-channel fan-out is handled by Channel Servers using consistent hashing, with edge Gateway Servers terminating WebSocket sessions.

Path C — Hybrid (push for small, pull for large)

Pick per message based on conversation size.

Hybrid fan-out decision flowchart: incoming send branches on conversation member count. Small (<100 members) takes the synchronous push path through gateways; large (>=100 members) publishes to Kafka and lets workers fan out in batches. Both paths fall back to push notifications for offline recipients. — Hybrid fan-out: a single threshold (around 100 members) routes between direct push and queue-backed fan-out so most 1:1 and small-group traffic stays on the fast path.

Best when:

Mix of 1:1, small groups, and large channels (most consumer messengers).
You need to balance latency for the common case against back-pressure for the long tail.
Some conversations have celebrity-level fan-out.

Trade-offs:

Optimal latency for the common case (1:1 and small groups stay on direct push).
Scales to large channels without overwhelming the message service.
Two code paths to operate and observe.
The threshold (~50–100 members) needs tuning against actual gateway CPU and tail latency.
Slightly inconsistent delivery timing across conversation sizes.

Real-world parallel: Discord uses a different mechanism but the same idea — every guild runs as a single Elixir GenServer that fans out via BEAM message passing through the Manifold library. For huge guilds Discord layers in Relays (intermediate fan-out processes) and Passive Sessions, which together cut fan-out cost by ~90% by skipping inactive viewers. Notably, Kafka is not on the chat-message hot path at Discord.

Path comparison

Factor	Connection-Centric	Queue-Centric	Hybrid
Latency (p50)	50–100ms	100–300ms	50–300ms
Max group size	~500	unbounded	unbounded
Ordering guarantee	per-conversation	per-partition	per-conversation
Failure isolation	gateway-level	partition-level	mixed
Replay	weak	native	mixed
Production parallel	WhatsApp	Slack (jobs/logs); LinkedIn messaging	Discord (BEAM-native); WhatsApp/Telegram for some flows

Why this article picks Path C

The hybrid approach:

Covers the full spectrum of use cases (1:1 to large channels).
Mirrors how production messengers actually behave — even when their plumbing differs (Discord’s GenServer + Manifold is conceptually a hybrid).
Forces the most interesting trade-offs to surface — threshold tuning, ordering inside vs across paths, hot-shard mitigation.

High-Level Design

The overview diagram at the top of the article is the system’s component map; this section drills into the four service boundaries that matter.

WebSocket Gateway

The gateway terminates client connections and routes everything else.

Responsibilities:

WebSocket connection lifecycle (upgrade, heartbeat, drain, close).
Authentication and session validation.
Routing client → service and service → client.
Presence event broadcasting on connect/disconnect.
Graceful drain on rolling deploys (notify clients to reconnect with a hint of which gateway to try next).

Design decisions:

Decision	Choice	Rationale
Protocol	WebSocket over TLS (RFC 6455)	Full-duplex; minimum 2-byte frame header server→client, 6 bytes client→server (mandatory mask, §5.3).
Session affinity	Consistent hashing by `user_id`, sticky at the L7 LB by JWT `sub` (or a `srv_id` cookie for browser fallbacks)	Predictable routing simplifies presence and per-user state. Cookie-based stickiness avoids the NAT/VPN clumping that pure source-IP hashing causes.
Cross-pod delivery	Redis pub/sub (or gRPC) keyed by `connection_id`	A message for a user on a different pod looks up the gateway in Redis and is routed there in one hop, without losing per-conversation ordering (which is enforced upstream at the message service).
Heartbeat	30s ping/pong (RFC 6455 §5.5.2)	Balance between detection speed and battery/cell-radio drain on mobile.
Connection timeout	90s (3 missed heartbeats)	Survives short cell handovers without dropping the session.
Max connections / pod	~500K	Memory-bound (~few KB per connection); kernel can do more after `sysctl` tuning.
Resume token	Opaque, short-TTL	Allows fast reconnect to the same logical session without re-establishing presence subscriptions.
Fallback transport	Long-poll receive + HTTP POST send	Some corporate proxies/firewalls block the WebSocket upgrade. Slack documents this “degraded mode” path.

Scaling approach:

Horizontal scaling with consistent hashing.
User → gateway mapping stored in Redis with short TTL.
Graceful drain: send connection.drain with a 30s deadline; clients reconnect during the window.

Important

Gateway pods are stateful in memory but stateless on disk. Treat the in-memory connection_id → user_id map as a cache of Redis. If a pod dies, clients reconnect, repopulate the map, and the system keeps moving — there is no on-disk state to recover.

Note

The capability-negotiation pattern from IRCv3 CAP LS 302 and labeled responses is worth borrowing for the handshake: advertise feature flags up front, let clients opt into typing tags / read receipts / batched chat history, and correlate every request with a server response via an explicit label. It avoids version-skew bugs as the protocol grows.

Message Service

The message service is the linearization point per conversation. Everything else can be eventually consistent; this cannot.

State per message:

1interface Message {2  messageId: string         // UUID, client-generated for idempotency3  conversationId: string    // 1:1 or group conversation4  senderId: string5  content: MessageContent6  timestamp: number         // server-assigned epoch millis7  sequenceNumber: bigint    // per-conversation monotonic8  status: MessageStatus     // PENDING | SENT | DELIVERED | READ9  expiresAt?: number        // optional TTL10}1112interface MessageContent {13  type: "text" | "image" | "file" | "location"14  text?: string15  mediaUrl?: string16  metadata?: Record<string, unknown>17}1819type MessageStatus = "PENDING" | "SENT" | "DELIVERED" | "READ"

Message flow on send:

Receive — gateway forwards message.send with the client-generated messageId.
Deduplicate — check messageId in the recent-message Redis cache (idempotent retry).
Validate — sender membership, rate limits, content checks.
Sequence + persist — assign sequence number atomically; write to ScyllaDB with server timestamp.
Acknowledge — return (sequenceNumber, timestamp) to the sender.
Route — direct push or Kafka, based on member count.
Fan-out — distribute to recipient gateways and the push gateway.

Presence Service

Presence is intentionally low-fidelity. The service holds heartbeat-derived TTL state in Redis and broadcasts deltas via pub/sub. The conceptual model — “online” as a TTL-bounded subscription, with deltas pushed to subscribers — is the same one XMPP RFC 6121 §3 defines for presence subscriptions, just realised with Redis instead of XML stanzas.

Presence pipeline sequence: a client sends a heartbeat to its gateway every 30s; the gateway HSETs presence:user with status=online and refreshes a 120s TTL in Redis. State changes (typing.start, going offline) write a SETEX or DEL and PUBLISH to presence:changes; subscriber gateways receive the delta, filter by their subscribed users, and push presence.update / typing.update to clients. When a client misses heartbeats, the Redis TTL expires and a keyspace notification triggers an offline broadcast. — Presence pipeline: heartbeat → Redis TTL → pub/sub fan-out to subscriber gateways. Last-seen falls out of the TTL automatically; no explicit "offline" message is required when the client just goes away.

Design decisions:

No persistence — presence is reconstructed from heartbeats; nothing on disk.
TTL-based — keys auto-expire; “last seen” is the negative space.
Pub/sub distribution — Redis pub/sub for delta updates between gateways.
Throttled — at most one presence update per user per second; typing indicators capped at one event per second.

Data structures:

1interface UserPresence {2  userId: string3  status: "online" | "away" | "offline"4  lastSeen: number          // epoch millis5  deviceType: "mobile" | "web" | "desktop"6  typingIn?: string         // conversationId if typing7  typingExpires?: number    // auto-clear after 5s8}

Presence subscription:

Users subscribe to presence of their contacts on connect.
Changes broadcast via Redis pub/sub channels keyed by user_id.
Gateways filter and forward to the relevant connected clients.

Note

LinkedIn’s presence platform sits at the same conceptual point — persistent connections (in their case Server-Sent Events, not WebSocket), an Akka actor per session, and a target end-to-end publish latency under 100ms. They explicitly smooth out connection flapping on lossy mobile networks; design for that early.

Sync Service

The sync service replays missed messages on reconnect. It is the boring but load-bearing service nobody talks about.

Sync protocol:

Each device maintains (conversationId → lastSequenceNumber) locally.
On reconnect, the device sends the list to the sync service.
The service returns all messages with sequenceNumber > lastSequenceNumber per conversation.
The client merges into its local store, deduplicating by messageId.

Pagination:

Default page size: 50 messages.
Cursor: (conversationId, sequenceNumber).
Bidirectional — forward (newer) and backward (older) fetching for history scrubbing.

API Design

WebSocket protocol

Connection handshake

1wss://chat.example.com/ws?token={jwt}&device_id={uuid}

Initial server → client message:

1{2  "type": "connected",3  "connectionId": "conn_abc123",4  "serverTime": 1706886400000,5  "heartbeatInterval": 30000,6  "resumeToken": "resume_xyz789"7}

Client → server messages

Send Message:

1{2  "type": "message.send",3  "id": "req_001",4  "payload": {5    "messageId": "msg_uuid_client_generated",6    "conversationId": "conv_abc123",7    "content": { "type": "text", "text": "Hello, world!" }8  }9}

Typing Indicator:

1{2  "type": "typing.start",3  "payload": { "conversationId": "conv_abc123" }4}

Mark Read:

1{2  "type": "message.read",3  "payload": {4    "conversationId": "conv_abc123",5    "upToSequence": 15426  }7}

Heartbeat:

1{2  "type": "heartbeat",3  "timestamp": 17068864000004}

Server → client messages

Acknowledgment:

1{2  "type": "message.ack",3  "id": "req_001",4  "payload": {5    "messageId": "msg_uuid_client_generated",6    "sequenceNumber": 1543,7    "timestamp": 17068864001238  }9}

New message:

1{2  "type": "message.new",3  "payload": {4    "messageId": "msg_xyz789",5    "conversationId": "conv_abc123",6    "senderId": "user_456",7    "content": { "type": "text", "text": "Hi there!" },8    "sequenceNumber": 1544,9    "timestamp": 170688640100010  }11}

Delivery receipt:

1{2  "type": "message.delivered",3  "payload": {4    "conversationId": "conv_abc123",5    "messageIds": ["msg_uuid_1", "msg_uuid_2"],6    "deliveredTo": "user_456",7    "timestamp": 17068864020008  }9}

Read receipt:

1{2  "type": "message.read_receipt",3  "payload": {4    "conversationId": "conv_abc123",5    "readUpToSequence": 1544,6    "readBy": "user_456",7    "timestamp": 17068864030008  }9}

Presence update:

1{2  "type": "presence.update",3  "payload": { "userId": "user_456", "status": "online", "typingIn": null }4}

Typing update:

1{2  "type": "typing.update",3  "payload": {4    "conversationId": "conv_abc123",5    "userId": "user_456",6    "isTyping": true7  }8}

REST API

Sync messages (offline recovery)

POST /api/v1/sync

1{2  "conversations": [3    { "conversationId": "conv_abc", "lastSequence": 1500 },4    { "conversationId": "conv_xyz", "lastSequence": 2300 }5  ],6  "limit": 1007}

1{2  "conversations": [3    {4      "conversationId": "conv_abc",5      "messages": [6        {7          "messageId": "msg_001",8          "senderId": "user_123",9          "content": { "type": "text", "text": "Hello" },10          "sequenceNumber": 1501,11          "timestamp": 1706886400000,12          "status": "DELIVERED"13        }14      ],15      "hasMore": false16    }17  ],18  "serverTime": 170688650000019}

Create conversation

POST /api/v1/conversations

1{ "type": "direct", "participantIds": ["user_456"] }

Create group

POST /api/v1/groups

1{2  "name": "Project Team",3  "participantIds": ["user_456", "user_789"],4  "settings": {5    "onlyAdminsCanPost": false,6    "allowMemberInvites": true7  }8}

Message history

GET /api/v1/conversations/{id}/messages?before={sequence}&limit=50

Error responses

Code	Error	When
400	`INVALID_MESSAGE`	Message format invalid
401	`UNAUTHORIZED`	Invalid or expired token
403	`FORBIDDEN`	Not a member of conversation
404	`CONVERSATION_NOT_FOUND`	Conversation doesn’t exist
409	`DUPLICATE_MESSAGE`	`messageId` already processed
429	`RATE_LIMITED`	Too many messages

1{2  "error": "RATE_LIMITED",3  "message": "Message rate limit exceeded",4  "retryAfter": 5,5  "limit": "100 messages per minute"6}

Data Modeling

Message storage (ScyllaDB)

Per-conversation time-series access dominates the workload, so we partition by conversation_id and cluster by sequence_number.

1CREATE TABLE messages (2    conversation_id UUID,3    sequence_number BIGINT,4    message_id UUID,5    sender_id UUID,6    content_type TEXT,7    content_text TEXT,8    content_media_url TEXT,9    timestamp TIMESTAMP,10    status TEXT,11    expires_at TIMESTAMP,12    PRIMARY KEY ((conversation_id), sequence_number)13) WITH CLUSTERING ORDER BY (sequence_number DESC);

Why ScyllaDB:

Time-series-friendly partitioning matches the access pattern.
Shard-per-core C++ runtime — no JVM GC pauses.
Linear horizontal scaling.
Discord migrated from Cassandra to ScyllaDB and reported p99 historical-fetch latency dropping from 40–125ms to a steady 15ms, with inserts dropping from 5–70ms to a steady 5ms — and the cluster collapsing from 177 Cassandra nodes to 72 ScyllaDB nodes.

Partitioning rules of thumb:

ScyllaDB defaults to a 1000MB partition warning threshold, but the practical “large partition” line is around 100MB — about 200K messages per conversation at 500B each.
For very active conversations, add a time bucket to the partition key: ((conversation_id, time_bucket), sequence_number) with daily buckets for active chats and monthly for archives.

Message deduplication cache (Redis)

1SETEX msg:dedup:{message_id} 86400 12EXISTS msg:dedup:{message_id}

24h TTL covers the longest realistic retry window from a flaky mobile client.

User and conversation metadata (PostgreSQL)

1CREATE TABLE users (2    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),3    username VARCHAR(50) UNIQUE NOT NULL,4    display_name VARCHAR(100),5    avatar_url TEXT,6    phone_hash VARCHAR(64) UNIQUE,7    created_at TIMESTAMPTZ DEFAULT NOW(),8    last_active_at TIMESTAMPTZ9);1011CREATE TABLE conversations (12    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),13    type VARCHAR(20) NOT NULL,        -- 'direct' or 'group'14    name VARCHAR(100),                -- NULL for direct15    avatar_url TEXT,16    created_by UUID REFERENCES users(id),17    created_at TIMESTAMPTZ DEFAULT NOW(),18    updated_at TIMESTAMPTZ DEFAULT NOW(),19    last_message_at TIMESTAMPTZ,20    last_sequence BIGINT DEFAULT 0,21    member_count INT DEFAULT 0,22    settings JSONB DEFAULT '{}'23);2425CREATE TABLE conversation_members (26    conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE,27    user_id UUID REFERENCES users(id) ON DELETE CASCADE,28    role VARCHAR(20) DEFAULT 'member', -- 'admin' or 'member'29    joined_at TIMESTAMPTZ DEFAULT NOW(),30    last_read_sequence BIGINT DEFAULT 0,31    muted_until TIMESTAMPTZ,32    PRIMARY KEY (conversation_id, user_id)33);3435CREATE INDEX idx_members_user ON conversation_members(user_id);36CREATE INDEX idx_conversations_updated ON conversations(updated_at DESC);

Session and connection state (Redis)

1SADD user:conn:{user_id} {connection_id}2SREM user:conn:{user_id} {connection_id}34HSET conn:{connection_id}5    gateway "gateway-1.us-east-1"6    user_id "user_123"7    device_id "device_abc"8    connected_at 1706886400000910HSET presence:{user_id}11    status "online"12    last_seen 170688640000013    device_type "mobile"14EXPIRE presence:{user_id} 1201516SETEX typing:{conversation_id}:{user_id} 5 1

Database selection matrix

Data	Store	Why
Messages	ScyllaDB	Time-series partitioning, low latency, horizontal scale
User profiles	PostgreSQL	ACID, complex queries, moderate scale
Conversation metadata	PostgreSQL	Relational queries, ACL management
Sessions, presence	Redis Cluster	Sub-ms latency, TTL, pub/sub
Message dedup cache	Redis	Fast lookups, automatic expiry
Media files	S3	Object storage with CDN integration
Analytics events	Kafka → ClickHouse	High-volume time-series analytics

Low-Level Design

Message delivery pipeline

Direct push (small groups)

For conversations with < 100 members:

1class DirectPushHandler {2  private readonly redis: RedisCluster3  private readonly messageStore: MessageStore45  async deliverMessage(message: Message): Promise<void> {6    const members = await this.getConversationMembers(message.conversationId)78    const deliveryTasks = members9      .filter((m) => m.userId !== message.senderId)10      .map(async (member) => {11        const connections = await this.redis.smembers(`user:conn:${member.userId}`)1213        if (connections.length > 0) {14          await Promise.all(connections.map((connId) => this.pushToConnection(connId, message)))15          return { userId: member.userId, status: "pushed" as const }16        } else {17          await this.queuePushNotification(member.userId, message)18          return { userId: member.userId, status: "queued" as const }19        }20      })2122    const results = await Promise.all(deliveryTasks)2324    const deliveredTo = results.filter((r) => r.status === "pushed").map((r) => r.userId)2526    if (deliveredTo.length > 0) {27      await this.notifyDeliveryReceipt(message, deliveredTo)28    }29  }3031  private async pushToConnection(connId: string, message: Message): Promise<void> {32    const connInfo = await this.redis.hgetall(`conn:${connId}`)33    const gateway = connInfo.gateway3435    await this.gatewayClient.send(gateway, {36      type: "deliver",37      connectionId: connId,38      message,39    })40  }41}

Kafka fan-out (large groups)

For conversations with ≥ 100 members:

1class KafkaFanoutHandler {2  private readonly kafka: KafkaProducer3  private readonly FANOUT_TOPIC = "messages.fanout"45  async publishForFanout(message: Message, memberCount: number): Promise<void> {6    await this.kafka.send({7      topic: this.FANOUT_TOPIC,8      messages: [9        {10          key: message.conversationId, // partition for ordering11          value: JSON.stringify({12            message,13            memberCount,14            publishedAt: Date.now(),15          }),16        },17      ],18    })19  }20}2122class FanoutConsumer {23  private readonly BATCH_SIZE = 1002425  async processMessage(record: KafkaRecord): Promise<void> {26    const { message, memberCount } = JSON.parse(record.value)2728    let offset = 029    while (offset < memberCount) {30      const memberBatch = await this.getMemberBatch(message.conversationId, offset, this.BATCH_SIZE)3132      await Promise.all(memberBatch.map((member) => this.deliverToMember(member, message)))3334      offset += this.BATCH_SIZE35    }36  }37}

Message ordering and sequencing

Every message walks the same state machine, regardless of fan-out path. Treat the diagram as the contract between the client UI and the server: anything the UI shows has to map to one of these states.

Message lifecycle state machine: a message moves from COMPOSING to PENDING when the client emits send with a UUID, to SENT once the server assigns a sequence number and persists, then to DELIVERED on recipient gateway ack or QUEUED_PUSH when the recipient is offline. QUEUED_PUSH transitions to DELIVERED on APNs/FCM ack or EXPIRED on TTL elapse. DELIVERED advances to READ on a read receipt. PENDING can fall back to FAILED, which the user can retry into PENDING again. — Message lifecycle: PENDING → SENT → DELIVERED/QUEUED_PUSH → READ. The same idempotent messageId is reused on retry, so PENDING ↔ FAILED loops do not create duplicates.

Sequence number assignment

1class SequenceGenerator {2  private readonly redis: RedisCluster34  async getNextSequence(conversationId: string): Promise<bigint> {5    const sequence = await this.redis.incr(`seq:${conversationId}`)6    return BigInt(sequence)7  }8}910class MessageProcessor {11  async processIncoming(conversationId: string, message: IncomingMessage): Promise<ProcessedMessage> {12    const lock = await this.acquireLock(`lock:msg:${conversationId}`, 5000)1314    try {15      const sequenceNumber = await this.sequenceGenerator.getNextSequence(conversationId)16      const timestamp = Date.now()1718      const processed: ProcessedMessage = {19        ...message,20        sequenceNumber,21        timestamp,22        status: "SENT",23      }2425      await this.messageStore.insert(processed)26      return processed27    } finally {28      await lock.release()29    }30  }31}

Caution

Per-conversation Redis INCR is the single most contended operation in this system. For very hot conversations, shard sequence assignment by routing all writes to one logical owner (the Channel Server pattern Slack uses) so the increment becomes a local memory op rather than a distributed lock.

Client-side ordering

1class ClientMessageBuffer {2  private pendingMessages: Map<string, Message[]> = new Map()3  private lastSequence: Map<string, bigint> = new Map()45  onMessageReceived(message: Message): void {6    const expected = (this.lastSequence.get(message.conversationId) || 0n) + 1n78    if (message.sequenceNumber === expected) {9      this.deliverToUI(message)10      this.lastSequence.set(message.conversationId, message.sequenceNumber)11      this.flushBuffer(message.conversationId)12    } else if (message.sequenceNumber > expected) {13      this.bufferMessage(message)14      this.requestMissing(message.conversationId, expected, message.sequenceNumber)15    }16    // sequence < expected → duplicate → ignore17  }1819  private flushBuffer(conversationId: string): void {20    const buffer = this.pendingMessages.get(conversationId) || []21    buffer.sort((a, b) => Number(a.sequenceNumber - b.sequenceNumber))2223    let expected = (this.lastSequence.get(conversationId) || 0n) + 1n24    while (buffer.length > 0 && buffer[0].sequenceNumber === expected) {25      const msg = buffer.shift()!26      this.deliverToUI(msg)27      this.lastSequence.set(conversationId, msg.sequenceNumber)28      expected++29    }3031    this.pendingMessages.set(conversationId, buffer)32  }33}

Multi-device sync

A user with phone, tablet, and laptop logged in is the common case, not the edge case. The model:

Each linked device is its own connection with a stable device_id.
The message service maintains a per-device cursor: (user_id, device_id, conversation_id) → lastSeq.
Messages are fanned out to all online devices for the user, including the sender’s own non-active devices (“self-echo”).
Read receipts collapse across devices: any device marking upToSequence = N advances the user’s per-conversation watermark; the other devices receive read_receipt events to dim notifications.

Multi-device sync sequence: a user sends from their phone; the message service persists once and fans out to the user's own web client and to other recipients in real time. The user's tablet was offline, so on reconnect it calls the sync service with its last seen sequence number and backfills the gap. A read action from any device advances a per-device cursor and broadcasts a read receipt to the user's other devices. — Multi-device sync: messages fan out to all online devices of the same user; offline devices catch up via the sync service on reconnect; read state propagates across the user's devices to keep notification badges consistent.

Note

WhatsApp’s multi-device design is the reference for E2EE messengers — each device has its own identity key and the sender encrypts once per device. For non-E2EE messengers like Telegram cloud chats, the model collapses to “the server-stored conversation is the truth and every device pulls from it” — simpler at the cost of giving up E2EE guarantees.

Presence system

Heartbeat processing

1class PresenceManager {2  private readonly PRESENCE_TTL = 1203  private readonly TYPING_TTL = 545  async handleHeartbeat(userId: string, deviceType: string): Promise<void> {6    const now = Date.now()78    await this.redis9      .multi()10      .hset(`presence:${userId}`, {11        status: "online",12        last_seen: now,13        device_type: deviceType,14      })15      .expire(`presence:${userId}`, this.PRESENCE_TTL)16      .exec()1718    await this.redis.publish(19      `presence:changes`,20      JSON.stringify({ userId, status: "online", timestamp: now }),21    )22  }2324  async handleDisconnect(userId: string): Promise<void> {25    const connections = await this.redis.smembers(`user:conn:${userId}`)2627    if (connections.length === 0) {28      const now = Date.now()2930      await this.redis.hset(`presence:${userId}`, { status: "offline", last_seen: now })3132      await this.redis.publish(33        `presence:changes`,34        JSON.stringify({ userId, status: "offline", lastSeen: now }),35      )36    }37  }3839  async setTyping(userId: string, conversationId: string): Promise<void> {40    await this.redis.setex(`typing:${conversationId}:${userId}`, this.TYPING_TTL, "1")4142    await this.redis.publish(43      `typing:${conversationId}`,44      JSON.stringify({ userId, isTyping: true }),45    )46  }47}

Presence subscription

1class PresenceSubscriber {2  private subscribedUsers: Set<string> = new Set()34  async subscribeToContacts(userId: string, contactIds: string[]): Promise<void> {5    const pipeline = this.redis.pipeline()6    contactIds.forEach((id) => pipeline.hgetall(`presence:${id}`))7    const results = await pipeline.exec()89    const presences = contactIds.map((id, i) => ({10      userId: id,11      ...(results[i][1] || { status: "offline" }),12    }))1314    this.sendToClient({ type: "presence.bulk", payload: { presences } })1516    contactIds.forEach((id) => this.subscribedUsers.add(id))17  }1819  onPresenceChange(change: PresenceChange): void {20    if (this.subscribedUsers.has(change.userId)) {21      this.sendToClient({ type: "presence.update", payload: change })22    }23  }24}

Push notifications for offline delivery

When a recipient has no live WebSocket, the gateway hands the message envelope to the push gateway, which bridges to APNs (iOS) and FCM (Android, Web). The boundary matters — these vendors have their own delivery semantics that the chat system has to design around:

Best-effort delivery, not guaranteed. APNs and FCM coalesce queued notifications when a device has been offline for too long; the push is a hint that the device should reconnect and replay via the sync service, not the durable copy.
Collapse keys / apns-collapse-id. Use a per-conversation collapse key so a flurry of messages while the device is offline shows up as one badge, not 50.
Silent / data-only pushes (content-available: 1, FCM data payload) wake the app to fetch the actual message over the chat channel, which avoids leaking message text into the OS notification payload — essential for E2EE messengers, where the server has no plaintext to put in the visible body.
TTL. Set a short TTL (apns-expiration / FCM time_to_live) so a queue that has aged past usefulness drops cleanly instead of floods the user on reconnect.
Throttling. APNs and FCM both throttle abusive senders; back off on 429 / Too many requests and surface the throttle to product (don’t retry blindly).

This layer is the most common source of “the system says delivered but the user never saw it” bug reports — design for the push to be advisory and the sync replay on reconnect to be authoritative.

Offline sync protocol

1class SyncService {2  async syncConversations(3    userId: string,4    syncState: Array<{ conversationId: string; lastSequence: bigint }>,5  ): Promise<SyncResponse> {6    const results: ConversationSync[] = []78    for (const { conversationId, lastSequence } of syncState) {9      const isMember = await this.checkMembership(userId, conversationId)10      if (!isMember) continue1112      const messages = await this.messageStore.getMessagesAfter(conversationId, lastSequence, 100)13      const conversation = await this.conversationStore.get(conversationId)1415      results.push({16        conversationId,17        messages,18        hasMore: messages.length === 100,19        lastSequence:20          messages.length > 0 ? messages[messages.length - 1].sequenceNumber : lastSequence,21        unreadCount: await this.getUnreadCount(userId, conversationId),22      })23    }2425    const newConversations = await this.getNewConversations(userId, syncState)2627    return { conversations: results, newConversations, serverTime: Date.now() }28  }29}

Frontend Considerations

Connection management

Reconnect with capped exponential backoff and jitter:

1class WebSocketManager {2  private ws: WebSocket | null = null3  private reconnectAttempt = 04  private readonly MAX_RECONNECT_DELAY = 300005  private readonly BASE_DELAY = 100067  connect(): void {8    this.ws = new WebSocket(this.buildUrl())910    this.ws.onopen = () => {11      this.reconnectAttempt = 012      this.onConnected()13    }1415    this.ws.onclose = (event) => {16      if (!event.wasClean) this.scheduleReconnect()17    }1819    this.ws.onerror = () => {20      // surfaces as onclose21    }22  }2324  private scheduleReconnect(): void {25    const delay = Math.min(26      this.BASE_DELAY * Math.pow(2, this.reconnectAttempt) + Math.random() * 1000,27      this.MAX_RECONNECT_DELAY,28    )29    this.reconnectAttempt++30    setTimeout(() => this.connect(), delay)31  }32}

Local message storage

IndexedDB schema for offline support:

1interface LocalDBSchema {2  messages: {3    key: [string, number]      // [conversationId, sequenceNumber]4    value: Message5    indexes: {6      "by-conversation": string7      "by-timestamp": number8      "by-status": string9    }10  }11  conversations: {12    key: string                 // conversationId13    value: ConversationMeta14    indexes: { "by-updated": number }15  }16  syncState: {17    key: string                 // conversationId18    value: { lastSequence: number; lastSync: number }19  }20}2122class LocalMessageStore {23  private db: IDBDatabase2425  async saveMessage(message: Message): Promise<void> {26    const tx = this.db.transaction("messages", "readwrite")27    await tx.objectStore("messages").put(message)28  }2930  async getMessages(31    conversationId: string,32    options: { before?: number; limit: number },33  ): Promise<Message[]> {34    const tx = this.db.transaction("messages", "readonly")35    const index = tx.objectStore("messages").index("by-conversation")3637    const range = IDBKeyRange.bound(38      [conversationId, 0],39      [conversationId, options.before || Number.MAX_SAFE_INTEGER],40    )4142    const messages: Message[] = []43    let cursor = await index.openCursor(range, "prev")4445    while (cursor && messages.length < options.limit) {46      messages.push(cursor.value)47      cursor = await cursor.continue()48    }4950    return messages51  }52}

Optimistic updates

1class MessageSender {2  async sendMessage(conversationId: string, content: MessageContent): Promise<void> {3    const clientMessageId = crypto.randomUUID()4    const optimisticMessage: Message = {5      messageId: clientMessageId,6      conversationId,7      senderId: this.currentUserId,8      content,9      timestamp: Date.now(),10      sequenceNumber: -1n,11      status: "PENDING",12    }1314    this.messageStore.addOptimistic(optimisticMessage)15    this.ui.appendMessage(optimisticMessage)16    await this.localDb.saveMessage(optimisticMessage)1718    try {19      const ack = await this.ws.sendAndWait({20        type: "message.send",21        payload: { messageId: clientMessageId, conversationId, content },22      })2324      const confirmedMessage = {25        ...optimisticMessage,26        sequenceNumber: ack.sequenceNumber,27        timestamp: ack.timestamp,28        status: "SENT" as const,29      }3031      this.messageStore.updateOptimistic(clientMessageId, confirmedMessage)32      await this.localDb.saveMessage(confirmedMessage)33    } catch (error) {34      this.messageStore.markFailed(clientMessageId)35      this.ui.showRetryOption(clientMessageId)36    }37  }38}

Virtual list for message history

1interface VirtualListConfig {2  containerHeight: number3  itemHeight: number  // estimated; variable heights supported4  overscan: number5}67class VirtualMessageList {8  private visibleRange = { start: 0, end: 0 }9  private heightCache = new Map<string, number>()1011  calculateVisibleRange(scrollTop: number): { start: number; end: number } {12    const messages = this.getMessages()13    let accumulatedHeight = 014    let start = 015    let end = messages.length1617    for (let i = 0; i < messages.length; i++) {18      const height = this.getItemHeight(messages[i])19      if (accumulatedHeight + height > scrollTop - this.config.overscan * 50) {20        start = i21        break22      }23      accumulatedHeight += height24    }2526    accumulatedHeight = 027    for (let i = start; i < messages.length; i++) {28      accumulatedHeight += this.getItemHeight(messages[i])29      if (accumulatedHeight > this.config.containerHeight + this.config.overscan * 50) {30        end = i + 131        break32      }33    }3435    return { start, end }36  }3738  render(): MessageItem[] {39    const { start, end } = this.visibleRange40    return this.getMessages().slice(start, end)41  }42}

Failure Modes and Operational Behavior

A senior reviewer will ask the same five questions about every distributed system. Here are the answers for this design.

Failure	Detection	Recovery	User-visible impact
Single gateway pod dies	Health check fails; Redis presence keys for that pod expire	Clients reconnect via the load balancer; consistent-hash ring rebalances; sync service backfills missed messages	A few seconds of stale presence; in-flight sends retried via dedup
Gateway rolling deploy	Drain signal	Pod sends `connection.drain` to its clients with a 30s deadline; clients reconnect to a new pod during the window	Brief reconnect blip; no message loss
Message service node crash	Heartbeat to control plane fails	Stateless service; new pod takes over; in-flight requests time out and are retried by the gateway with the same `messageId`	Slightly higher tail latency for the few seconds of failover
Redis pub/sub partition	Subscriber lag metric	Presence updates degrade gracefully — TTL-based fallback still works; typing indicators may be lost	Stale presence and missing typing indicators; messages unaffected
Kafka broker loss	ISR shrinks; producer ack timeouts	Producer retries with idempotence; partition leadership fails over within seconds; consumers resume from committed offsets	Brief delivery latency spike for large groups; small groups unaffected
ScyllaDB node loss	Coordinator notices unavailable replicas	RF=3 quorum still satisfies reads/writes; node rejoins; repair restores data	None at the user level
Hot conversation (celebrity, megagroup)	Partition CPU saturation, sequence-INCR contention, fan-out lag	Pin to dedicated channel-server affinity; coalesce writes; promote to queue-backed fan-out; downsample passive viewers (Discord-style)	Non-active viewers see slightly stale state; active conversation stays responsive
Network partition between regions	Inter-region link health checks	Each region serves its local users; cross-region messages queue and replay on heal; conversations split-brained for the partition window	Cross-region delivery degraded; intra-region unaffected
Total WebSocket failure (firewall, proxy)	Client cannot upgrade	Fall back to long-polling for receive and HTTP `POST` for send (Slack’s “degraded mode”)	Higher latency; receipts delayed; basic functionality preserved

Warning

The single most common production-killer in chat systems is the thundering herd on reconnect after a regional gateway restart. Always randomize reconnect delay with jitter, cap concurrent connection attempts per region, and pre-warm the gateway pool before draining the old fleet.

Hot-conversation mitigation

A tiny number of conversations carry an outsized fraction of the load. Three patterns work in practice:

Channel-server affinity — pin a conversation’s writes to one in-memory owner. Slack’s Channel Server design uses consistent hashing so a host owns ~16M channels at peak. This makes per-conversation INCR a local op.
Active vs passive sessions — only fan out real-time updates to viewers actively reading. Discord’s Maxjourney post reports a ~90% reduction in fan-out cost from this single change.
Relay tier — insert intermediate fan-out processes between the conversation owner and the connected sessions so the owner only has to send to a handful of relays (which then fan out further). Discord’s Relays make 1M+ online users in a single guild tractable.

End-to-end encryption

E2EE is structurally different from server-side encryption: the server never holds plaintext, so it cannot search, moderate, or transcode messages, and the work of key management is pushed onto the clients. The protocol choice splits cleanly along conversation cardinality.

1:1 — X3DH + Double Ratchet

The Signal Protocol is the de facto baseline. It composes two pieces:

X3DH (Extended Triple Diffie-Hellman) is an asynchronous key-agreement protocol. The recipient publishes a prekey bundle — long-term identity key, signed prekey, and one-time prekeys — to the server. The sender can derive a shared secret without the recipient being online, by combining four DH operations: identity↔signed-prekey, ephemeral↔identity, ephemeral↔signed-prekey, and (when available) ephemeral↔one-time-prekey, fed through HKDF. This gives mutual authentication, forward secrecy, and cryptographic deniability.
Double Ratchet drives the per-message keys after X3DH. A symmetric KDF chain produces a unique message key for every message (forward secrecy: compromising the current chain key does not reveal past keys), and a fresh DH ratchet step on every reply rotates the root secret (post-compromise security: an attacker who steals state at time t loses access once both sides have ratcheted past t).

Operationally this means the server stores ciphertext only, plus an opaque envelope (recipient_device_id, ciphertext, iv, mac, prekey_id) per recipient.

Group — Sender Keys vs MLS

Two designs dominate, with very different cost curves.

Sender Keys vs MLS group key flow: the left pane shows Signal-style Sender Keys, where each member runs a 1:1 X3DH/Double Ratchet session with every other member to distribute their per-group sender chain key, with O(N) cost per join/leave and an O(N) rekey on remove. The right pane shows the MLS TreeKEM ratchet tree, with leaves for each member rolling up through internal nodes to a root secret; a path update touches log2(N) nodes, a Commit advances the epoch atomically, and a Welcome onboards new joiners. — Sender Keys vs MLS: Sender Keys keep message-time cheap (one symmetric encrypt + broadcast) but pay O(N) on every membership change; MLS pays O(log N) per change and gets post-compromise security across the whole group as a side effect.

Sender Keys (Signal-style group): every member generates a per-group symmetric sender chain key and signature key, and distributes it to every other member over a pairwise Double Ratchet channel. Sending a message is then one symmetric encrypt + one broadcast. The cost is membership churn: adding or removing a member requires re-keying for everyone (O(N) ciphertexts), and post-compromise security is awkward at large group sizes.

MLS / RFC 9420: members sit at the leaves of a left-balanced binary ratchet tree (TreeKEM). Each interior node holds a secret known only to its subtree; the root secret keys the group. A Commit applies a batch of Add/Remove/Update Proposals, and the committer encrypts the new path keys to the copath, so every member can derive the new root in O(log N) work. A Welcome message bootstraps a new joiner with the current group state. Each Commit advances an epoch, providing built-in forward secrecy and post-compromise security across the whole group, with transcript continuity guaranteed by the epoch chain.

Property	Sender Keys	MLS (RFC 9420)
Send cost	O(1) symmetric encrypt + broadcast	O(1) symmetric encrypt + broadcast
Membership change	O(N) pairwise re-key	O(log N) path update via TreeKEM
Post-compromise security	Hard at scale; needs O(N²)-ish rotation	Built in; every Commit refreshes the tree
Async joiners	Pairwise Sesame-style negotiation	Welcome message + ratchet-tree extension
Practical sweet spot	Small/medium groups (≤ ~50)	Medium-to-large groups (hundreds to thousands)
Production examples	WhatsApp groups, Signal groups	Wire, Cisco Webex (Secure Messaging), RingCentral. WhatsApp’s EU DMA interop bridge keeps its own chats on the Signal Protocol and exposes a Signal-based gateway to third parties.

Multi-device E2EE

Multi-device complicates everything because the server can’t read the message but still has to fan it out to the right device set. WhatsApp’s multi-device design treats every device as its own client with its own identity key; the sender encrypts the message once per recipient device, and the server treats the encrypted bundle as opaque payloads keyed by (recipient_user_id, recipient_device_id). History sync to a newly linked device is its own protocol — the primary device encrypts a recent-history blob and ships it over an E2EE channel.

Operational implications

E2EE constrains the rest of the design in concrete ways:

No server-side search. Search is entirely client-side, against the local message store.
No server-side spam/abuse heuristics on content. Defences move to metadata (rate limits, sender reputation, reported messages, on-device classifiers) and to user-driven reporting that ships the offending ciphertext + key for review.
Backups need their own key derivation. End-to-end encrypted backups (Signal’s, WhatsApp’s recently rolled-out E2EE backups) require an HSM or user-passphrase-derived key, not the messaging keys.
Key transparency (Keybase-style logs, or WhatsApp’s auditable directory) is the practical answer to “is the key the server gave me really the recipient’s?”
Push notification content must be a placeholder (“New message from Alice”); the actual message decrypts on-device.

Caution

Do not bolt E2EE onto an existing chat system as an afterthought. The control-plane operations — group membership, device addition, history transfer, lost-device recovery — are where E2EE designs fail in production, not the cipher. Pick the protocol (Signal vs MLS) before fixing the rest of the architecture in stone.

Infrastructure

Cloud-agnostic component map

Component	Purpose	Options
WebSocket Gateway	Persistent connections	Envoy, HAProxy, Nginx (ws)
Message Queue	Async delivery, ordering	Kafka, Pulsar, NATS JetStream
KV Store	Sessions, presence	Redis, KeyDB, Dragonfly
Message Store	Message persistence	ScyllaDB, Cassandra, DynamoDB
Relational DB	User/group metadata	PostgreSQL, CockroachDB
Object Store	Media files	MinIO, Ceph, S3-compatible
Push Gateway	Mobile notifications	Self-hosted bridge to APNs / FCM

AWS reference architecture

Service configurations:

Service	Configuration	Rationale
WebSocket (Fargate)	4 vCPU, 8GB, ~500K conn/pod	Memory-bound; ~500K connections/pod
Message Service	2 vCPU, 4GB	Stateless, CPU-bound
Fan-out Workers	2 vCPU, 4GB, Spot	Cost optimization for async
ElastiCache Redis	r6g.2xlarge cluster mode	Sub-ms presence lookups
Keyspaces	On-demand	Serverless Cassandra-compatible
RDS PostgreSQL	db.r6g.xlarge Multi-AZ	Metadata, moderate write load
MSK	kafka.m5.large × 3 brokers	Fan-out throughput

Scaling considerations

WebSocket connection limits:

Linux default: 1024 file descriptors per process — raise via sysctl (fs.file-max, nofile ulimit) and per-process RLIMIT_NOFILE.
Practical per pod: ~500K (memory, not FDs, is the bottleneck).
50M concurrent users → 100 gateway pods minimum.

Kafka partitioning:

Partition by conversation_id for ordering.
Minimum partitions: enough to allow N parallel consumers without head-of-line blocking; start at 100.
Hot-partition mitigation: extract very active conversations into their own dedicated topic with finer-grained partition keys.

Message storage partitioning:

ScyllaDB partition key: conversation_id.
Practical “large partition” line: ~100MB; warn at 1000MB.
For very active conversations: add a time bucket to the partition key.

Presence fan-out:

Redis pub/sub scales well for low-fan-out subscriptions; for users with thousands of contacts, prefer batched delta channels or a dedicated presence service over per-user pub/sub channels.

Conclusion

What this design buys you:

Sub-500ms message delivery via stateful WebSocket gateways and a hybrid push/pull fan-out.
At-least-once delivery with client-side deduplication on stable messageIds.
Strong per-conversation ordering through server-assigned sequence numbers.
Seamless offline support via per-device cursors and a sync service.
Multi-device coherence through self-echo fan-out and per-device read cursors.
Hot-shard tolerance through channel-server affinity, active/passive sessions, and a relay tier.
E2EE-ready surface with Double Ratchet for 1:1 and a clear Sender Keys → MLS migration path for groups.

What you accept in return:

A central server is on the critical path for ordering — pure P2P is excluded.
At-least-once delivery requires client-side dedup logic.
Presence is eventually consistent; expect tens of seconds of staleness on flaky networks.
Large-group delivery latency is structurally higher than 1:1.

What’s deliberately out of scope and worth its own follow-up:

Reactions, threaded replies, edits with tombstones, and scheduled messages.
Voice and video calling integration over WebRTC, including SFU/MCU and the signalling channel.
Federated chat across organisations (the XMPP RFC 6120 S2S model is still the cleanest reference) and the metadata-privacy implications of running a federation gateway.

Appendix

Prerequisites

Distributed systems fundamentals (consistency models, partitioning).
Real-time communication patterns (WebSocket, pub/sub).
Message-queue concepts (Kafka partitions, consumer groups).
Database selection trade-offs (SQL vs. NoSQL).

Terminology

Term	Definition
Fan-out	Distributing a message to multiple recipients.
Sequence number	Monotonically increasing identifier for ordering messages within a conversation.
Presence	User’s online/offline status and activity indicators.
Idempotency	Property ensuring duplicate requests produce the same result.
Heartbeat	Periodic signal from client to server indicating the connection is alive.
ACK	Acknowledgment message confirming receipt.
TTL	Time-to-live; automatic expiration of data after a specified duration.
Channel server	An in-memory owner of a conversation that serializes writes for ordering.
Self-echo	Delivering a sent message to the sender’s own other devices.

Summary

Real-time chat needs persistent connections (WebSocket), reliable delivery (at-least-once with dedup), and ordering guarantees (server-assigned sequence numbers per conversation).
Hybrid fan-out keeps small conversations on a low-latency direct push path and routes large ones through a queue-backed fan-out.
ScyllaDB is a strong fit for time-series message storage; Discord’s migration produced concrete latency wins worth referencing.
Redis holds the ephemeral state — sessions, presence, typing — with TTLs and pub/sub.
Multi-device sync is solved by per-device cursors plus self-echo fan-out, not per-device server-side queues.
Failure modes drive the design as much as the happy path; plan for thundering herds, hot conversations, and Kafka rebalances on day one.

References

Real-world implementations:

How Discord Stores Trillions of Messages — Cassandra → ScyllaDB migration, Rust data services, latency numbers.
How Discord Scaled Elixir to 5M Concurrent Users — GenServer per guild, Manifold for distributed message passing.
Maxjourney: Pushing Discord’s Limits with a Million+ Online Users in a Single Server — Relays, Passive Sessions, hot-guild mitigation.
Real-Time Messaging at Slack — Channel Servers, Gateway Servers, consistent hashing.
Building Self-Driving Kafka Clusters at Slack — 6.5 Gbps, 10 clusters, ~700 TB.
Migrating Millions of Concurrent WebSockets to Envoy — terminating WebSocket at Envoy.
LinkedIn’s Real-Time Presence Platform — SSE, Akka actors, Couchbase.
How WhatsApp Grew to ~500M Users on FreeBSD/Erlang — ~2M concurrent connections per server, kernel tuning.
How WhatsApp Enables Multi-Device Capability — per-device identity keys, client fan-out for E2EE.

Protocol specifications:

RFC 6455 — The WebSocket Protocol — frame format, masking, ping/pong.
RFC 6120 — XMPP Core — XML stream/stanza model, S2S federation.
RFC 6121 — XMPP Instant Messaging and Presence — roster, presence subscription model.
RFC 9420 — The Messaging Layer Security (MLS) Protocol — TreeKEM, epochs, Commits, Welcomes.
Signal — The Double Ratchet Algorithm — per-message keys, forward secrecy, post-compromise security.
Signal — The X3DH Key Agreement Protocol — asynchronous key agreement.
IRCv3 — Capability Negotiation and Message Tags — +typing, msgid, labeled-response, echo-message.
MQTT 5.0 (OASIS) — IoT/messaging protocol.

Distributed systems theory:

Lamport — Time, Clocks, and the Ordering of Events in a Distributed System — foundational paper on logical ordering.
Exactly-Once Semantics in Apache Kafka — why client dedup beats exactly-once for chat.
ScyllaDB — Making Effective Partitions for Data Modeling — partition-size guidance.

Related articles:

Design Collaborative Document Editing — real-time sync with OT/CRDT.
Design a Notification System — multi-channel push delivery.