Real-Time Sync Client

Client-side architecture for real-time data synchronization: transport protocols, connection management, conflict resolution, and state reconciliation patterns used by Figma, Notion, Discord, and Linear.

Abstract

Real-time sync is bidirectional state convergence between client and server under network uncertainty. The core tension: users expect instant feedback, but the server is the source of truth.

Mental model:

Optimistic local state — Apply mutations immediately; rollback on server rejection
Transport selection — WebSocket for bidirectional low-latency; SSE for server-push simplicity; polling as fallback
Connection resilience — Exponential backoff with jitter; heartbeats detect stale connections; reconnect without data loss
Conflict resolution — Last-write-wins at property level (Figma), Operational Transform for text (Google Docs), CRDTs for offline-first (Notion, Linear)
State reconciliation — Server acknowledgments; re-apply pending local mutations on top of authoritative state
Presence — Ephemeral metadata (cursors, typing indicators) via heartbeat or CRDT broadcast

The approach depends on data shape, offline requirements, and acceptable consistency latency.

The Challenge

Why Real-Time Sync Is Hard

Real-time sync violates the request-response model that HTTP assumes. Three constraints create tension:

Latency vs. Consistency — Users expect instant feedback, but network round-trips are unavoidable
Offline vs. Conflicts — Offline edits require local storage, which creates divergent state that must merge
Scale vs. Complexity — Millions of connections require infrastructure that single-server models cannot provide

Browser Constraints

Resource	Limit	Impact
WebSocket connections per profile	~255 in Chrome, ~200 in Firefox	Shared across all tabs and origins; the HTTP/1.1 6/origin cap does not apply to WebSocket sockets
SSE connections per origin (HTTP/1.1)	6	Shared across all tabs; HTTP/2 multiplexes streams over a single TCP connection so the cap effectively disappears
Main thread budget	16 ms/frame	Long sync operations cause UI jank
IndexedDB quota	Up to ~60% of disk per origin (Chrome), ~10% (Firefox)	Large local caches can hit `QuotaExceededError`

The 6-per-origin connection cap that frustrates HTTP/1.1 polling does not apply to WebSocket — Chromium counts WebSocket sockets against a separate global pool (source). Multiple tabs sharing the same origin are still capped, just much higher than naive readings of the WebSocket / EventSource specs imply.

Note

WebSockets historically blocked the back/forward cache (bfcache). As of 2024 Chromium will evict the page from bfcache and close the WebSocket on entry instead of refusing to cache. Best practice is still to close connections explicitly on pagehide (event.persisted === true) and reopen on pageshow, so other tabs do not accumulate idle sockets and so reconnection logic owns the lifecycle.

Device and Network Profiles

Scenario	Latency	Constraints
Desktop + fiber	10–50ms	Full WebSocket, generous caching
Mobile + LTE	50–200ms	Battery drain from radio wake; bundle updates
Mobile + 3G	200–500ms	Aggressive local caching; defer non-critical sync
Offline	N/A	Queue mutations locally; replay on reconnect

Cellular radio “tail time” keeps the antenna active for seconds after each transmission. Batch messages to minimize battery impact.

Transport Protocols

WebSocket (RFC 6455)

WebSocket provides full-duplex communication over a single TCP connection. After an HTTP handshake, frames flow bidirectionally with minimal overhead.

Protocol characteristics:

Property	Value
Direction	Bidirectional
Frame overhead	2–14 bytes (after handshake)
Data types	Text (UTF-8), binary
Auto-reconnect	No (must implement)
Compression	Per-message DEFLATE (RFC 7692)

Handshake:

1GET /chat HTTP/1.12Host: server.example.com3Upgrade: websocket4Connection: Upgrade5Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==6Sec-WebSocket-Version: 13

The server responds with 101 Switching Protocols, and the connection upgrades to WebSocket framing.

Design reasoning: RFC 6455 chose HTTP-compatible handshake so WebSocket traffic traverses proxies and firewalls that allow HTTP. The Sec-WebSocket-Key prevents cross-protocol attacks by requiring the server to prove it understands WebSocket.

Limitations:

No backpressure — If the server sends faster than the client processes, messages buffer in memory until the browser crashes or discards data. The non-standard, Chromium-only WebSocketStream (shipped to stable in Chrome 124) integrates with the Streams API to expose per-message backpressure; cross-browser, the only path to backpressure today is WebTransport over HTTP/3.
No multiplexing — Each logical channel requires a separate connection or application-level multiplexing. RFC 8441 lets WebSocket ride a single HTTP/2 stream, which sidesteps the per-origin TCP cost but does not provide multiplexing across logical channels.
Connection limits — A single origin can open dozens of WebSocket sockets in modern browsers, but they share a global per-profile pool (~255 in Chrome, ~200 in Firefox). Application-level fan-in (one socket multiplexed across components) is almost always the right design.

When to use: Chat, collaborative editing, gaming, financial tickers — any scenario requiring low-latency bidirectional communication.

Server-Sent Events (SSE)

SSE is server-to-client push over HTTP. The server holds an open connection and streams events as text/event-stream.

Event format:

1event: message2data: {"user":"alice","text":"hello"}3id: 123454retry: 300056: keep-alive comment

Field	Purpose
`event`	Event type name (triggers named listeners)
`data`	Payload (UTF-8 only)
`id`	Resume token for reconnection
`retry`	Reconnection delay in milliseconds

Design reasoning: SSE uses HTTP, so it works through any proxy or firewall that allows HTTP. The id field enables resumption: on reconnect, the browser sends Last-Event-ID, and the server can replay missed events.

Advantages over WebSocket:

Built-in reconnection with configurable delay
Works through restrictive corporate proxies
Simpler server implementation (any HTTP server)

Limitations:

Server-to-client only — the client uses an ordinary fetch/XMLHttpRequest POST for uplink.
UTF-8 text only — no binary frames. Encode with base64 or switch to WebSocket / WebTransport if you need bytes.
Over HTTP/1.1 the 6-per-origin TCP cap applies; HTTP/2 multiplexes streams over one TCP connection so the cap effectively disappears.

When to use: Live feeds, notifications, dashboards — scenarios where the server pushes and the client occasionally posts.

Long Polling

Long polling simulates push by having the client hold an open HTTP request until the server has data or timeout.

Flow:

Client sends GET /events?since=12345
Server holds connection until new data exists or 30s timeout
Server responds with data (or empty on timeout)
Client immediately sends next request

Design reasoning: Long polling works everywhere HTTP works. Before WebSocket existed, this was the only reliable cross-browser push mechanism. It remains useful for restrictive networks or legacy browser support.

Trade-offs:

Aspect	Long Polling	WebSocket	SSE
Latency	Medium (one RTT per message)	Low	Low
Overhead	Full HTTP headers per response	2 bytes/frame	~5 bytes/event
Firewall compatibility	Excellent	Sometimes blocked	Excellent
Implementation complexity	Low	Medium	Low

When to use: Fallback when WebSocket and SSE fail; infrequent updates where latency is acceptable.

WebTransport (Future-Forward)

WebTransport is the modern successor to WebSocket, layered on HTTP/3 (QUIC). It provides multiple unidirectional/bidirectional streams over a single connection, optional unreliable datagrams, and proper per-stream backpressure via the Streams API.

Property	WebTransport
Transport	HTTP/3 (QUIC over UDP)
Streams per session	Many; reliable + unreliable
Backpressure	Native (Streams API)
Head-of-line blocking	None at transport (per-stream HOLB)
Browser availability	Chromium stable; Firefox behind a flag

For greenfield real-time apps that need binary, multiplexed, backpressured channels and can tolerate Chromium-leading availability, WebTransport is the strategically right choice. WebSocket remains the safe default for cross-browser parity.

Decision Framework

Decision tree for choosing between WebSocket, SSE, and long polling based on directionality, latency needs, and offline support. — Pick a transport by directionality, latency tolerance, and whether the client needs to upstream data.

Connection Management

A long-lived sync connection is a small state machine: open, heartbeat, drop, back off, retry. The same loop drives WebSocket, SSE, and long-polling clients — only the transport-specific signals (handshake, error, Last-Event-ID) differ.

State diagram for a sync connection: connecting transitions to open on handshake, open emits periodic heartbeats and reverts to closing on pong timeout, closing routes through a backoff state that retries with jittered exponential delay until success or max attempts. — Connection lifecycle: a Connecting state moves to Open on handshake, sends periodic heartbeats, and on drop or pong timeout enters a Backoff state that schedules the next attempt with jittered exponential delay (or terminates on auth failure or max attempts).

Reconnection with Exponential Backoff

Network disconnections are inevitable. Naive immediate retry causes thundering herd problems when servers recover. Exponential backoff with jitter spreads reconnection attempts.

1// Types and constants2type ReconnectionOptions = {3  initialDelay?: number4  maxDelay?: number5  multiplier?: number6  jitter?: number7  maxAttempts?: number8}910class ReconnectionManager {11  private currentDelay: number12  private attempts = 01314  constructor(private options: Required<ReconnectionOptions>) {15    this.currentDelay = options.initialDelay16  }1718  // Core logic: calculate next delay with jitter19  getNextDelay(): number | null {20    if (this.attempts >= this.options.maxAttempts) {21      return null // Give up22    }2324    const jitterRange = this.currentDelay * this.options.jitter25    const jitter = Math.random() * jitterRange * 2 - jitterRange26    const delay = Math.min(this.currentDelay + jitter, this.options.maxDelay)2728    this.currentDelay = Math.min(this.currentDelay * this.options.multiplier, this.options.maxDelay)29    this.attempts++3031    return delay32  }3334  reset(): void {35    this.currentDelay = this.options.initialDelay36    this.attempts = 037  }38}3940// Usage41const manager = new ReconnectionManager({42  initialDelay: 1000,43  maxDelay: 30000,44  multiplier: 2,45  jitter: 0.1,46  maxAttempts: 10,47})

Design reasoning:

Exponential growth caps at a maximum (30s typical) to prevent infinite waits
Jitter (±10%) desynchronizes clients that disconnected at the same time
Attempt limit prevents infinite loops against a permanently down service

Production considerations:

Error Type	Retry Strategy
Network timeout	Full backoff
5xx server error	Full backoff
401 Unauthorized	Do not retry; re-authenticate
429 Rate Limited	Honor `Retry-After` header
DNS failure	Backoff with longer max delay

Heartbeat and Connection Health

TCP keep-alives are insufficient for application-level connection health. A dead connection may not trigger a TCP reset for minutes. Application heartbeats detect stale connections faster.

1// WebSocket wrapper with heartbeat2type HeartbeatOptions = {3  interval: number // How often to send ping4  timeout: number // How long to wait for pong5}67class HeartbeatConnection {8  private pingTimer: number | null = null9  private pongTimer: number | null = null10  private ws: WebSocket1112  constructor(13    url: string,14    private options: HeartbeatOptions,15  ) {16    this.ws = new WebSocket(url)17    this.ws.onopen = () => this.startHeartbeat()18    this.ws.onmessage = (e) => this.handleMessage(e)19    this.ws.onclose = () => this.stopHeartbeat()20  }2122  // Core heartbeat logic23  private startHeartbeat(): void {24    this.pingTimer = window.setInterval(() => {25      this.ws.send(JSON.stringify({ type: "ping", ts: Date.now() }))2627      this.pongTimer = window.setTimeout(() => {28        // No pong received - connection is dead29        this.ws.close(4000, "Heartbeat timeout")30      }, this.options.timeout)31    }, this.options.interval)32  }3334  private handleMessage(event: MessageEvent): void {35    const data = JSON.parse(event.data)36    if (data.type === "pong") {37      if (this.pongTimer) clearTimeout(this.pongTimer)38      return39    }40    // Handle application messages...41  }4243  private stopHeartbeat(): void {44    if (this.pingTimer) clearInterval(this.pingTimer)45    if (this.pongTimer) clearTimeout(this.pongTimer)46  }47}

Typical values:

Setting	Value	Reasoning
Ping interval	15–30s	Balance between detection speed and bandwidth
Pong timeout	5–10s	Allow for network jitter
Missed pings before disconnect	2–3	Avoid false positives from single dropped packet

State Recovery on Reconnect

Reconnection without state recovery loses messages sent during disconnection. Two patterns handle this:

1. Sequence-based recovery:

1// Track last acknowledged sequence2interface Message {3  seq: number4  payload: unknown5}67class SequenceRecovery {8  private lastAckedSeq = 09  private pendingMessages: Message[] = []1011  send(payload: unknown): void {12    const msg: Message = {13      seq: this.pendingMessages.length + this.lastAckedSeq + 1,14      payload,15    }16    this.pendingMessages.push(msg)17    this.ws.send(JSON.stringify(msg))18  }1920  // On reconnect, request replay from last ack21  onReconnect(): void {22    this.ws.send(23      JSON.stringify({24        type: "resume",25        lastSeq: this.lastAckedSeq,26      }),27    )28  }2930  onAck(seq: number): void {31    this.lastAckedSeq = seq32    this.pendingMessages = this.pendingMessages.filter((m) => m.seq > seq)33  }3435  onServerReplay(messages: Message[]): void {36    // Server sends missed messages since lastSeq37    for (const msg of messages) {38      this.handleMessage(msg)39    }40  }41}

2. Event sourcing recovery (SSE):

SSE’s Last-Event-ID header automatically requests replay:

1const eventSource = new EventSource("/events")23eventSource.onmessage = (event) => {4  // event.lastEventId contains the id from the server5  // On reconnect, browser automatically sends Last-Event-ID header6  processEvent(JSON.parse(event.data))7}

Design reasoning: Sequence numbers are simpler but require server-side storage of recent messages. Event sourcing naturally fits event logs but requires the server to support replay from arbitrary points.

Message Handling

Ordering Guarantees

Out-of-order delivery happens when:

Multiple WebSocket connections exist (load balancing)
Server processes messages in parallel
Network path changes mid-stream

Strategies:

Strategy	Complexity	Guarantee
Single connection, FIFO	Low	Total order
Sequence numbers per sender	Medium	Per-sender order
Vector clocks	High	Causal order
Accept disorder	None	Eventual consistency

For most applications, per-sender ordering suffices:

1// Per-sender message ordering2type SenderId = string34class OrderedMessageHandler {5  private lastSeq = new Map<SenderId, number>()6  private pending = new Map<SenderId, Map<number, unknown>>()78  handle(senderId: SenderId, seq: number, payload: unknown): void {9    const expected = (this.lastSeq.get(senderId) ?? 0) + 11011    if (seq === expected) {12      // In order - process and check pending13      this.process(payload)14      this.lastSeq.set(senderId, seq)15      this.processPending(senderId)16    } else if (seq > expected) {17      // Out of order - buffer18      const senderPending = this.pending.get(senderId) ?? new Map()19      senderPending.set(seq, payload)20      this.pending.set(senderId, senderPending)21    }22    // seq < expected means duplicate - ignore23  }2425  private processPending(senderId: SenderId): void {26    const senderPending = this.pending.get(senderId)27    if (!senderPending) return2829    let next = (this.lastSeq.get(senderId) ?? 0) + 130    while (senderPending.has(next)) {31      this.process(senderPending.get(next))32      senderPending.delete(next)33      this.lastSeq.set(senderId, next)34      next++35    }36  }3738  private process(payload: unknown): void {39    // Application-specific processing40  }41}

Deduplication

Retries and reconnection can deliver the same message multiple times. Idempotency keys prevent duplicate processing:

1// Time-bounded deduplication2const DEDUP_WINDOW_MS = 5 * 60 * 1000 // 5 minutes34class Deduplicator {5  private processed = new Map<string, number>() // id -> timestamp67  isDuplicate(messageId: string): boolean {8    this.cleanup()910    if (this.processed.has(messageId)) {11      return true12    }1314    this.processed.set(messageId, Date.now())15    return false16  }1718  private cleanup(): void {19    const cutoff = Date.now() - DEDUP_WINDOW_MS20    for (const [id, ts] of this.processed) {21      if (ts < cutoff) {22        this.processed.delete(id)23      }24    }25  }26}2728// Usage29const dedup = new Deduplicator()3031function handleMessage(msg: { id: string; payload: unknown }): void {32  if (dedup.isDuplicate(msg.id)) {33    return // Already processed34  }35  processPayload(msg.payload)36}

Trade-offs:

Window Size	Memory	Risk
1 minute	Low	May miss slow retries
5 minutes	Medium	Covers most retry scenarios
1 hour	High	Handles extended outages

Optimistic Updates

Optimistic updates show changes immediately while the server processes asynchronously. If the server rejects, rollback to previous state. The lifecycle has three branches: success (clear pending), validation failure (restore snapshot, surface error), and network failure (keep pending, retry with backoff).

Sequence diagram showing optimistic update flow: snapshot, apply locally, enqueue, send, then either ack-and-clear, restore-on-error, or keep-and-retry on network failure. — Optimistic update lifecycle: apply locally, enqueue with idempotency key, then resolve via ack, restore, or retry-with-backoff.

Pattern Implementation

1// Optimistic update with rollback2type Todo = { id: string; text: string; completed: boolean }3type TodoStore = {4  todos: Todo[]5  pendingUpdates: Map<string, { previous: Todo; current: Todo }>6}78const store: TodoStore = { todos: [], pendingUpdates: new Map() }910async function toggleTodo(id: string): Promise<void> {11  const todo = store.todos.find((t) => t.id === id)12  if (!todo) return1314  // 1. Save previous state15  const previous = { ...todo }1617  // 2. Apply optimistic update18  const updated = { ...todo, completed: !todo.completed }19  store.todos = store.todos.map((t) => (t.id === id ? updated : t))20  store.pendingUpdates.set(id, { previous, current: updated })2122  // 3. Notify UI (immediate feedback)23  renderTodos()2425  try {26    // 4. Send to server27    await fetch(`/api/todos/${id}`, {28      method: "PATCH",29      body: JSON.stringify({ completed: updated.completed }),30      headers: { "Idempotency-Key": `toggle-${id}-${Date.now()}` },31    })3233    // 5. Success - remove from pending34    store.pendingUpdates.delete(id)35  } catch (error) {36    // 6. Failure - rollback37    store.todos = store.todos.map((t) => (t.id === id ? store.pendingUpdates.get(id)!.previous : t))38    store.pendingUpdates.delete(id)39    renderTodos()40    showError("Failed to update todo")41  }42}4344// TanStack Query equivalent45const mutation = useMutation({46  mutationFn: (todo: Todo) => updateTodo(todo),47  onMutate: async (newTodo) => {48    await queryClient.cancelQueries({ queryKey: ["todos"] })49    const previous = queryClient.getQueryData(["todos"])50    queryClient.setQueryData(["todos"], (old: Todo[]) => old.map((t) => (t.id === newTodo.id ? newTodo : t)))51    return { previous }52  },53  onError: (err, newTodo, context) => {54    queryClient.setQueryData(["todos"], context?.previous)55  },56  onSettled: () => {57    queryClient.invalidateQueries({ queryKey: ["todos"] })58  },59})

When Optimistic Updates Break

Scenario	Problem	Mitigation
Concurrent edits	Two users edit same item	Server conflict resolution; merge or reject
Validation failure	Server rejects invalid data	Client-side validation before optimistic apply
Network partition	User thinks action succeeded	Queue mutations; replay on reconnect
Race conditions	Stale read before write	Version vectors; conditional updates

React 19 useOptimistic:

The hook signature is useOptimistic(state, updateFn) where updateFn(currentState, optimisticValue) returns the next optimistic state. The dispatch function must run inside an Action (startTransition or a form action); React automatically reverts to state once the transition settles (React docs).

1import { useOptimistic, startTransition } from 'react';23function TodoItem({ todo, onUpdate }: { todo: Todo; onUpdate: (t: Todo) => Promise<void> }) {4  const [optimisticTodo, applyOptimistic] = useOptimistic(5    todo,6    (current, completed: boolean) => ({ ...current, completed }),7  );89  function toggle() {10    const nextCompleted = !optimisticTodo.completed;11    startTransition(async () => {12      applyOptimistic(nextCompleted);13      await onUpdate({ ...todo, completed: nextCompleted });14    });15  }1617  const isPending = optimisticTodo.completed !== todo.completed;1819  return (20    <li style={{ opacity: isPending ? 0.5 : 1 }}>21      <input type="checkbox" checked={optimisticTodo.completed} onChange={toggle} />22      {todo.text}23    </li>24  );25}

Important

useOptimistic only stays in its optimistic state for the lifetime of the surrounding transition. If you need the optimistic value to survive across renders independent of an async action, manage it with useState plus your own rollback path.

Conflict Resolution

When multiple clients edit the same data, conflicts arise. The resolution strategy depends on data shape and acceptable complexity.

Last-Write-Wins (LWW)

Each write carries a timestamp; the latest timestamp wins. Simple but loses data from concurrent edits.

Used by: Cassandra, DynamoDB, Figma (at property level)

1type LWWRegister<T> = {2  value: T3  timestamp: number4  clientId: string5}67function merge<T>(local: LWWRegister<T>, remote: LWWRegister<T>): LWWRegister<T> {8  if (remote.timestamp > local.timestamp) {9    return remote10  }11  if (remote.timestamp === local.timestamp) {12    // Tiebreaker: lexicographic client ID comparison13    return remote.clientId > local.clientId ? remote : local14  }15  return local16}

Figma’s approach: LWW at property level, not object level. Two users editing different properties of the same shape (e.g., color vs. position) both succeed. Only same-property edits conflict.

Design reasoning: Figma rejected Operational Transform as “overkill”—their data model is a tree of objects with properties, not a linear text document. Property-level LWW is simpler and sufficient for design files.

Clock skew problem: LWW assumes synchronized clocks. NTP skew of 0.5+ seconds can cause “earlier” writes to win. Mitigations:

Use server timestamps (single source of truth)
Hybrid logical clocks (HLC) combining physical time with logical counters
Version vectors for causal ordering

Operational Transformation (OT)

OT transforms operations based on concurrent operations to preserve user intent. Designed for linear sequences (text documents).

Used by: Google Docs, Microsoft Office Online

Example:

1Initial: "abc"2User A: Insert "X" at position 2 → "abXc"3User B: Delete position 1 → "ac" (concurrent with A)45Without transformation:6  Apply A's insert to B's result: "aXc" (wrong - X at wrong position)78With transformation:9  Transform A's operation: Insert at 2 becomes Insert at 1 (because B deleted before position 2)10  Result: "aXc" becomes correct

1// Simplified OT for text2type Insert = { type: "insert"; pos: number; char: string }3type Delete = { type: "delete"; pos: number }4type Op = Insert | Delete56function transform(op1: Op, op2: Op): Op {7  // Transform op1 assuming op2 has been applied8  if (op1.type === "insert" && op2.type === "insert") {9    if (op1.pos <= op2.pos) {10      return op1 // op1 is before op2, no change11    }12    return { ...op1, pos: op1.pos + 1 } // Shift right13  }1415  if (op1.type === "insert" && op2.type === "delete") {16    if (op1.pos <= op2.pos) {17      return op118    }19    return { ...op1, pos: op1.pos - 1 } // Shift left20  }2122  if (op1.type === "delete" && op2.type === "insert") {23    if (op1.pos < op2.pos) {24      return op125    }26    return { ...op1, pos: op1.pos + 1 }27  }2829  if (op1.type === "delete" && op2.type === "delete") {30    if (op1.pos < op2.pos) {31      return op132    }33    if (op1.pos > op2.pos) {34      return { ...op1, pos: op1.pos - 1 }35    }36    // Same position - op1 is a no-op (already deleted)37    return { type: "insert", pos: -1, char: "" } // No-op sentinel38  }3940  return op141}4243// OT requires central server for operation ordering44class OTClient {45  private pending: Op[] = []46  private serverVersion = 04748  applyLocal(op: Op): void {49    this.pending.push(op)50    this.sendToServer(op, this.serverVersion)51  }5253  onServerAck(ackVersion: number): void {54    this.serverVersion = ackVersion55    this.pending.shift() // Remove acknowledged operation56  }5758  onServerOp(op: Op): void {59    // Transform all pending operations against server operation60    for (let i = 0; i < this.pending.length; i++) {61      this.pending[i] = transform(this.pending[i], op)62    }63    this.applyToDocument(op)64    this.serverVersion++65  }66}

Trade-offs:

Aspect	Advantage	Disadvantage
Consistency	Strong (immediate)	Requires central server
Complexity	Intent-preserving	O(n²) worst case; hard to implement correctly
Latency	Low (operations, not state)	Server bottleneck under load

CRDTs (Conflict-Free Replicated Data Types)

CRDTs are data structures whose merge operation is commutative, associative, and idempotent, so any two replicas that have seen the same set of operations — in any order, with arbitrary duplication — converge to the same state without coordination (Shapiro et al., 2011).

Used by: Notion (offline pages), Linear (issue descriptions via Yjs), Figma (loosely inspired, not strictly), Automerge, Yjs.

Common CRDT types:

Type	Use Case	Example
G-Counter	Grow-only counter	Page view counts
PN-Counter	Increment/decrement	Inventory stock
LWW-Register	Single value	User status
OR-Set	Add/remove set	Tags, members
RGA/LSEQ	Ordered text	Collaborative documents

1// G-Counter CRDT2type NodeId = string34class GCounter {5  private counts = new Map<NodeId, number>()67  constructor(private nodeId: NodeId) {}89  increment(): void {10    const current = this.counts.get(this.nodeId) ?? 011    this.counts.set(this.nodeId, current + 1)12  }1314  value(): number {15    let sum = 016    for (const count of this.counts.values()) {17      sum += count18    }19    return sum20  }2122  // Merge is commutative, associative, idempotent23  merge(other: GCounter): void {24    for (const [nodeId, count] of other.counts) {25      const current = this.counts.get(nodeId) ?? 026      this.counts.set(nodeId, Math.max(current, count))27    }28  }2930  serialize(): Record<NodeId, number> {31    return Object.fromEntries(this.counts)32  }33}

Tombstone problem: Deletes in many CRDTs leave a tombstone (metadata indicating “this id was removed”) that must be retained until every replica has observed it. In long-lived offline scenarios tombstones can accumulate without bound. Mitigations:

Coordinated garbage collection — only safe to drop a tombstone once every replica has observed it (hard in peer-to-peer; tractable behind a central server).
Time-based expiry — simple, but a long-disconnected replica can resurrect deleted items.
Periodic snapshot compaction — periodically rebuild a snapshot that omits acknowledged tombstones, then ship the snapshot as the new baseline.

Design reasoning: CRDTs trade memory for simplicity. No conflict resolution logic needed—the math guarantees convergence. This makes them ideal for offline-first apps where merge timing is unpredictable.

Decision Matrix

Factor	LWW	OT	CRDT
Complexity	Low	High	Medium
Offline support	Poor	None	Excellent
Memory overhead	Low	Low	High (tombstones)
Central server required	No	Yes	No
Best for	Key-value, properties	Text documents	Offline-first, P2P

State Reconciliation

When client and server state diverge, reconciliation brings them back in sync without losing pending local changes.

Client-Side Prediction with Server Reconciliation

Originally from game development (Gambetta’s series is the canonical reference), this pattern applies local changes immediately but treats server state as authoritative. When the server acknowledges input #N, the client resets to the server snapshot and replays inputs N+1, N+2, … on top.

Sequence diagram of client-side prediction and server reconciliation: pending input queue, server snapshot with lastAckedId, drop acknowledged inputs, replay remaining pending inputs. — Client-side prediction with server reconciliation: server snapshot is authoritative; remaining pending inputs are replayed on top to keep responsiveness.

1// State reconciliation with pending input replay2type Input = { id: string; action: string; params: unknown }3type State = {4  /* application state */5}67class ReconciliationEngine {8  private pendingInputs: Input[] = []9  private localState: State10  private lastAckedInputId: string | null = null1112  applyInput(input: Input): void {13    // 1. Apply locally for immediate feedback14    this.localState = this.applyToState(this.localState, input)1516    // 2. Track as pending17    this.pendingInputs.push(input)1819    // 3. Send to server20    this.sendToServer(input)21  }2223  onServerUpdate(serverState: State, lastAckedId: string): void {24    // 1. Remove acknowledged inputs25    const ackIndex = this.pendingInputs.findIndex((i) => i.id === lastAckedId)26    if (ackIndex !== -1) {27      this.pendingInputs = this.pendingInputs.slice(ackIndex + 1)28    }2930    // 2. Start from authoritative server state31    let reconciledState = serverState3233    // 3. Re-apply pending (unacknowledged) inputs34    for (const input of this.pendingInputs) {35      reconciledState = this.applyToState(reconciledState, input)36    }3738    // 4. Update local state39    this.localState = reconciledState4041    // 5. Re-render42    this.render()43  }4445  private applyToState(state: State, input: Input): State {46    // Application-specific state transition47    return state48  }4950  private sendToServer(input: Input): void {51    // WebSocket send52  }5354  private render(): void {55    // Update UI56  }57}

Why this works: The server is always right. Local state is a prediction that will be corrected. By re-applying pending inputs on top of server state, we maintain responsiveness while converging to truth.

Smooth vs. snap reconciliation:

Snap: Immediately apply server correction (can cause visual jitter)
Smooth: Interpolate toward server state over multiple frames (better UX for games/animations, more complex)

Full State Sync

For simpler applications, fetch full state on reconnect and discard local state:

1// Simple full state sync on reconnect2async function reconnect(): Promise<void> {3  const state = await fetch("/api/state").then((r) => r.json())4  store.setState(state)5  render()6}78// With local mutation queue9async function reconnectWithQueue(): Promise<void> {10  // 1. Fetch server state11  const serverState = await fetch("/api/state").then((r) => r.json())1213  // 2. Replay queued mutations14  for (const mutation of localQueue) {15    try {16      await fetch("/api/mutate", {17        method: "POST",18        body: JSON.stringify(mutation),19      })20    } catch {21      // Handle permanent failure22    }23  }2425  // 3. Fetch final state (includes replayed mutations)26  const finalState = await fetch("/api/state").then((r) => r.json())27  store.setState(finalState)28  localQueue.clear()29  render()30}

Presence

Presence tracks ephemeral user state: who’s online, cursor positions, typing indicators.

Design Considerations

Aspect	Consideration
Persistence	Not needed—presence is ephemeral
Update frequency	Cursors: 20–60 Hz; typing: on change; online: 15–30s heartbeat
Bandwidth	Throttle cursor updates; batch presence changes
Cleanup	Automatic on disconnect; timeout for crashed clients

Implementation Patterns

Heartbeat-based online status:

1// Heartbeat presence2const HEARTBEAT_INTERVAL = 15000 // 15 seconds3const TIMEOUT_THRESHOLD = 45000 // 3 missed heartbeats45class PresenceManager {6  private users = new Map<string, { lastSeen: number; metadata: unknown }>()7  private heartbeatTimer: number | null = null89  start(userId: string, metadata: unknown): void {10    this.heartbeatTimer = window.setInterval(() => {11      this.ws.send(12        JSON.stringify({13          type: "presence",14          userId,15          metadata,16          timestamp: Date.now(),17        }),18      )19    }, HEARTBEAT_INTERVAL)20  }2122  onPresenceUpdate(userId: string, metadata: unknown): void {23    this.users.set(userId, { lastSeen: Date.now(), metadata })24    this.pruneStale()25    this.render()26  }2728  private pruneStale(): void {29    const now = Date.now()30    for (const [userId, data] of this.users) {31      if (now - data.lastSeen > TIMEOUT_THRESHOLD) {32        this.users.delete(userId)33      }34    }35  }3637  stop(): void {38    if (this.heartbeatTimer) clearInterval(this.heartbeatTimer)39  }4041  getOnlineUsers(): string[] {42    return Array.from(this.users.keys())43  }44}

Cursor sharing:

1// Throttled cursor sharing2type CursorPosition = { x: number; y: number; userId: string }34class CursorPresence {5  private cursors = new Map<string, CursorPosition>()6  private throttledSend: (pos: CursorPosition) => void78  constructor(9    private ws: WebSocket,10    private userId: string,11  ) {12    // Throttle to 30 updates per second max13    this.throttledSend = throttle((pos: CursorPosition) => {14      ws.send(JSON.stringify({ type: "cursor", ...pos }))15    }, 33)16  }1718  onLocalMove(x: number, y: number): void {19    this.throttledSend({ x, y, userId: this.userId })20  }2122  onRemoteCursor(cursor: CursorPosition): void {23    this.cursors.set(cursor.userId, cursor)24    this.renderCursors()25  }2627  onUserLeave(userId: string): void {28    this.cursors.delete(userId)29    this.renderCursors()30  }3132  private renderCursors(): void {33    // Render remote cursors with user labels34  }35}3637function throttle<T extends (...args: unknown[]) => void>(fn: T, ms: number): T {38  let lastCall = 039  return ((...args: Parameters<T>) => {40    const now = Date.now()41    if (now - lastCall >= ms) {42      lastCall = now43      fn(...args)44    }45  }) as T46}

Phoenix Presence (CRDT-based)

Phoenix Channels uses a CRDT-based presence system that automatically syncs across cluster nodes:

Each presence update is a CRDT merge operation
No single point of failure
Automatic cleanup when connections close
Built-in conflict resolution for presence metadata

Design reasoning: Presence is inherently distributed (users connect to different servers). CRDT semantics guarantee all nodes converge to the same view without coordination.

Real-World Implementations

Figma: Multiplayer Design

Scale: Millions of concurrent editors

Architecture:

Client/server with WebSocket; one centralized multiplayer service per file is the authority.
File state lives in memory on that service for speed; periodic checkpoints to durable storage capture full snapshots.
A persistent journal (transaction log) records every mutation between checkpoints, so a crash can replay forward from the last snapshot without losing accepted edits.

Conflict resolution: Property-level LWW. Two simultaneous changes to different properties of the same object both succeed; only same-property conflicts compare timestamps. Figma describes the design as “inspired by CRDTs” rather than a true CRDT — the central server is the source of truth, which sidesteps the offline-merge cases CRDTs were invented to handle.

Why not OT? Figma’s data model is a tree of objects with properties, not linear text. OT is optimized for character-level text operations. Property-level LWW is simpler and sufficient.

Fractional indexing for ordered sequences: Children in a tree are positioned by an arbitrary-precision fraction (encoded as a string) between their two neighbors. Inserting between A ("a0") and B ("a1") is just allocating a new fraction in between ("a0V"). No global reindexing required.

Sources: How Figma’s multiplayer technology works, Realtime editing of ordered sequences, and Making multiplayer more reliable.

Notion: Offline-First

Challenge: Block-based architecture where pages reference blocks that reference other blocks. Opening a page requires the full transitive closure of referenced blocks; a missing block means broken content.

Architecture:

SQLite is used as a persistent local store on native clients (and IndexedDB on the web), evolved from a best-effort cache into a durable backing store specifically for Offline Mode.
Pages explicitly marked Available offline are migrated to a CRDT-backed representation so concurrent edits merge automatically; non-text properties (e.g. database select) can still surface merge conflicts.
Each client tracks a lastDownloadedTimestamp per offline page; on reconnect it compares with the server’s lastUpdatedTime and pulls only pages that have moved.
Local mutations queue in an offline_action table; the sync layer reconciles those actions with authoritative state when the client comes back online.

Design decision: If a page might be missing transitively required data, Notion refuses to render it rather than show partial content. Missing data is worse UX than “unavailable offline.”

Source: How we made Notion available offline and The data model behind Notion’s flexibility.

Linear: Local-First Sync

Architecture:

A normalized object graph lives in memory, modeled with MobX for fine-grained reactivity, and is mirrored to IndexedDB as the local source of truth.
Mutations are wrapped in client transactions, persisted locally, then sent to the server. The server assigns a total order of transactions and broadcasts deltas — Linear is centralized, not peer-to-peer, and does not run Operational Transformation.
For most fields (status, assignee, priority, comments) the server’s totally ordered LWW write is the resolution. Linear adopted CRDTs (Yjs) specifically for issue descriptions because that is the rich-text surface where two writers genuinely overlap on the same characters.
Search is instant because every issue in the workspace is already in JavaScript memory; the search box is a filter over an array, not an API call.

Why it works: Issue trackers have bounded data per workspace (unlike open-ended documents). Full client-side data plus a centralized total order enables instant interactions without paying the conceptual tax of full peer-to-peer CRDT.

Competitive advantage: “Snappiness” is Linear’s primary differentiator from Jira. Local-first makes every interaction feel instant.

Sources: Scaling the Linear Sync Engine (Linear) and the reverse-engineered architecture write-up endorsed by Linear’s CTO.

Discord: Message Delivery at Scale

Scale: Trillions of messages, millions of concurrent connections

Architecture:

Gateway infrastructure in Elixir on the BEAM VM, designed around lightweight processes and message passing.
A single GenServer per guild (server) acts as the central router for that guild; one session process per connected user receives the fan-out.
For very large guilds, intermediate relay processes split the fan-out tree, and the Manifold library batches PIDs by remote node to keep cross-node message passing scalable.
Hot reference data (member rosters) lives in ETS so multiple processes can read without copying into the heap and without piling pressure on the GC.
Storage evolution: MongoDB → Cassandra (2017) → ScyllaDB (2022). The migration to ScyllaDB was paired with a Rust-based “data services” layer that does request coalescing and consistent-hash routing keyed by channel_id, shielding the database from hot partitions.

Message fanout:

User sends message.
Guild process receives it and resolves the recipient set.
Guild process (optionally via relays) hands off to every connected member’s session process.
Each session process forwards over its WebSocket to that device.

Sources: How Discord stores trillions of messages and How Discord scaled Elixir to 5,000,000 concurrent users.

Slack: Real-Time at Enterprise Scale

Scale: Tens of millions of channels per host, 500ms message delivery worldwide

Architecture:

Channel Servers (CS): Stateful, in-memory servers holding channel history (~16M channels per host)
Gateway Servers (GS): Maintain WebSocket connections, deployed across regions
CHARMs: Consistent hash ring managers ensuring CS replacement within 20 seconds

Reliability guarantees:

Messages have strong guarantees around arrival
Ordered and delivered exactly once
All messages are persisted
Idempotency keys prevent duplicates
Kafka for durable queuing + Redis for fast in-flight job data

Source: Slack Engineering Blog

Browser Constraints Deep Dive

Main Thread Budget

The main thread has 16ms per frame for 60fps. Real-time sync operations compete with rendering:

Operation	Typical Cost	Mitigation
JSON.parse (1KB)	0.1–0.5ms	Stream parsing for large payloads
JSON.parse (100KB)	5–50ms	Web Worker
IndexedDB write	1–10ms	Batch writes; requestIdleCallback
DOM update (100 items)	5–20ms	Virtual lists; batched updates

Offload to Web Workers:

1// Main thread2const worker = new Worker("sync-worker.js")34worker.postMessage({ type: "parse", data: rawJson })56worker.onmessage = (e) => {7  // Parsed data ready8  updateState(e.data)9}1011// sync-worker.js12self.onmessage = (e) => {13  if (e.data.type === "parse") {14    const parsed = JSON.parse(e.data.data)15    self.postMessage(parsed)16  }17}

Memory Management

Browser-imposed quotas vary by vendor and shift between releases. Treat the numbers below as orders-of-magnitude — always probe at runtime via navigator.storage.estimate() before you trust a budget.

Browser	WebSocket send-buffer (`bufferedAmount`) practical ceiling	IndexedDB quota (per origin)
Chrome	~1 GB before tab eviction	Up to ~60 % of disk; pool capped per origin
Firefox	~500 MB	~10 % of disk per group
Safari	~256 MB	~1 GB before user prompt

WebSocket backpressure (experimental):

1// Chrome 124+ WebSocketStream with backpressure2const ws = new WebSocketStream("wss://example.com")3const { readable, writable } = await ws.opened45const reader = readable.getReader()6while (true) {7  const { value, done } = await reader.read()8  if (done) break910  // Backpressure: read() naturally pauses when we can't keep up11  await processMessage(value)12}

Storage Quotas

1// Handle quota exceeded gracefully2async function cacheData(key: string, value: unknown): Promise<void> {3  const db = await openDB("cache", 1)45  try {6    await db.put("data", { key, value, timestamp: Date.now() })7  } catch (e) {8    if (e.name === "QuotaExceededError") {9      // Evict oldest entries10      const all = await db.getAll("data")11      all.sort((a, b) => a.timestamp - b.timestamp)1213      // Delete oldest 20%14      const toDelete = all.slice(0, Math.ceil(all.length * 0.2))15      for (const item of toDelete) {16        await db.delete("data", item.key)17      }1819      // Retry20      await db.put("data", { key, value, timestamp: Date.now() })21    } else {22      throw e23    }24  }25}

Mobile and Offline Considerations

Battery Optimization

Strategy	Impact	Implementation
Batch updates	High	Buffer messages for 1–5s before send
Adaptive polling	Medium	Increase interval on cellular
Binary protocols	Medium	MessagePack, Protocol Buffers
Sync on Wi-Fi only	High	Defer large sync until Wi-Fi detected

1// Network-aware sync strategy2type ConnectionType = "wifi" | "cellular" | "none"34function getConnectionType(): ConnectionType {5  const connection = (navigator as unknown as { connection?: { type: string } }).connection6  if (!connection) return "wifi" // Assume best7  return connection.type === "wifi" ? "wifi" : "cellular"8}910class NetworkAwareSync {11  private batchBuffer: unknown[] = []12  private flushTimer: number | null = null1314  send(message: unknown): void {15    this.batchBuffer.push(message)1617    const delay = getConnectionType() === "cellular" ? 2000 : 1001819    if (!this.flushTimer) {20      this.flushTimer = window.setTimeout(() => {21        this.flush()22      }, delay)23    }24  }2526  private flush(): void {27    if (this.batchBuffer.length === 0) return2829    this.ws.send(30      JSON.stringify({31        type: "batch",32        messages: this.batchBuffer,33      }),34    )3536    this.batchBuffer = []37    this.flushTimer = null38  }39}4041// Detect network type changes42navigator.connection?.addEventListener("change", () => {43  adjustSyncStrategy(getConnectionType())44})

Offline Queue

1// Persistent offline mutation queue2import { openDB, IDBPDatabase } from "idb"34type Mutation = {5  id: string6  action: string7  payload: unknown8  timestamp: number9}1011class OfflineQueue {12  private db: IDBPDatabase | null = null1314  async init(): Promise<void> {15    this.db = await openDB("offline-queue", 1, {16      upgrade(db) {17        db.createObjectStore("mutations", { keyPath: "id" })18      },19    })20  }2122  async enqueue(mutation: Omit<Mutation, "id" | "timestamp">): Promise<void> {23    const item: Mutation = {24      ...mutation,25      id: crypto.randomUUID(),26      timestamp: Date.now(),27    }28    await this.db!.add("mutations", item)29  }3031  async flush(): Promise<void> {32    const mutations = await this.db!.getAll("mutations")33    mutations.sort((a, b) => a.timestamp - b.timestamp)3435    for (const mutation of mutations) {36      try {37        await this.sendToServer(mutation)38        await this.db!.delete("mutations", mutation.id)39      } catch (e) {40        // Stop on first failure; retry later41        break42      }43    }44  }4546  private async sendToServer(mutation: Mutation): Promise<void> {47    const response = await fetch("/api/mutate", {48      method: "POST",49      headers: {50        "Content-Type": "application/json",51        "Idempotency-Key": mutation.id,52      },53      body: JSON.stringify(mutation),54    })5556    if (!response.ok) {57      throw new Error(`Server error: ${response.status}`)58    }59  }6061  async getPendingCount(): Promise<number> {62    return (await this.db!.getAll("mutations")).length63  }64}6566// Usage with online/offline detection67const queue = new OfflineQueue()68await queue.init()6970window.addEventListener("online", () => queue.flush())

Background Sync (Service Worker)

1// Service Worker background sync2self.addEventListener("sync", (event: SyncEvent) => {3  if (event.tag === "sync-mutations") {4    event.waitUntil(syncMutations())5  }6})78async function syncMutations(): Promise<void> {9  const db = await openDB("offline-queue", 1)10  const mutations = await db.getAll("mutations")1112  for (const mutation of mutations) {13    try {14      await fetch("/api/mutate", {15        method: "POST",16        headers: {17          "Content-Type": "application/json",18          "Idempotency-Key": mutation.id,19        },20        body: JSON.stringify(mutation),21      })22      await db.delete("mutations", mutation.id)23    } catch {24      // Will retry on next sync event25      return26    }27  }28}2930// Register from main thread31navigator.serviceWorker.ready.then((registration) => {32  return registration.sync.register("sync-mutations")33})

Failure Modes and Edge Cases

Common Failure Scenarios

Scenario	Symptom	Detection	Recovery
Network partition	Messages queued, no acks	Heartbeat timeout	Reconnect with sequence recovery
Server restart	WebSocket close event	`close` event handler	Exponential backoff reconnect
Message loss	Missing sequence numbers	Gap detection	Request replay from server
Duplicate delivery	Same message twice	Idempotency key check	Skip processing
Clock skew	LWW picks wrong winner	N/A (hard to detect)	Use server timestamps or HLC
Thundering herd	Server overload on recovery	Server-side monitoring	Jittered backoff
Split brain	Divergent state	Consensus protocol	CRDT convergence or manual resolve

Testing Checklist

Rapid connect/disconnect cycles (10x in 1 second)
Slow network simulation (3G, 500ms latency)
Large payloads (10MB+ messages)
Many concurrent tabs (hit connection limits)
Device sleep/wake cycles
Offline for extended periods (1+ hours)
Server restart during active session
Network type change (Wi-Fi to cellular)
Clock adjustment during session

Conclusion

Real-time sync client architecture balances three tensions: latency vs. consistency, offline support vs. conflict complexity, and simplicity vs. reliability. The right approach depends on your data model and user expectations:

Property-level LWW (Figma) for structured objects where concurrent edits to different properties should both succeed.
OT (Google Docs) for text documents requiring strong consistency and intent preservation through a central server.
CRDTs (Notion offline pages, Linear issue descriptions, Yjs/Automerge) for surfaces where two writers can edit the same characters concurrently or while offline.
Centralized total-order LWW (Linear core, Slack messages) for everything else where a single coordinator can assign order cheaply.
Full state sync for simpler applications where complexity isn’t justified.

Connection management is non-negotiable: exponential backoff with jitter, heartbeats for health detection, and sequence-based recovery for message continuity. Optimistic updates with rollback provide the instant feedback users expect while maintaining server authority.

The production systems that get this right—Figma, Linear, Discord, Slack—invest heavily in the sync layer because it defines the core user experience. A 100ms delay feels instant; a 500ms delay feels sluggish; anything over 1s feels broken.

Appendix

Prerequisites

WebSocket API fundamentals
Async JavaScript (Promises, async/await)
State management patterns
Basic distributed systems concepts (consensus, eventual consistency)

Terminology

Term	Definition
CRDT	Conflict-Free Replicated Data Type—data structures that merge without coordination
OT	Operational Transformation—algorithm that transforms concurrent operations to preserve intent
LWW	Last-Write-Wins—conflict resolution using timestamps
Optimistic update	Applying changes locally before server confirmation
Reconciliation	Process of merging divergent client and server state
Presence	Ephemeral user state (online status, cursor position)
Heartbeat	Periodic signal to detect connection health
Tombstone	Marker indicating deleted item (in CRDTs)
Vector clock	Logical timestamp tracking causal ordering across nodes
Idempotency	Property where repeated operations have same effect as single execution

Summary

Transport selection: WebSocket for bidirectional low-latency; SSE for server-push simplicity; polling as fallback
Connection resilience: Exponential backoff with jitter prevents thundering herd; heartbeats detect stale connections
Message handling: Sequence numbers for ordering; idempotency keys for deduplication
Optimistic updates: Apply locally, rollback on server rejection—users expect instant feedback
Conflict resolution: LWW for simple cases; OT for text; CRDTs for offline-first
State reconciliation: Server is authoritative; re-apply pending local mutations on top of server state
Presence: Ephemeral, doesn’t need persistence; throttle high-frequency updates (cursors)

References

Specifications:

RFC 6455: The WebSocket Protocol - Authoritative WebSocket specification
RFC 7692: Compression Extensions for WebSocket - Per-message DEFLATE compression
RFC 8441: Bootstrapping WebSockets with HTTP/2 - WebSocket multiplexing over HTTP/2
WHATWG HTML Living Standard: Server-sent events - SSE specification

Official Documentation:

MDN: WebSocket API - Browser WebSocket implementation
MDN: Server-Sent Events - SSE usage guide
MDN: IndexedDB - Client-side storage for offline support

Production Engineering Blogs:

Figma: How multiplayer technology works - Property-level LWW, fractional indexing
Figma: Making multiplayer more reliable - Checkpointing, write-ahead log
Notion: How we made Notion available offline - CRDT-based offline sync
Linear: Scaling the sync engine - Local-first architecture
Discord: How Discord stores trillions of messages - Gateway architecture, storage evolution
Slack: Real-time messaging - Channel servers, consistency guarantees

Research and Theory:

CRDT.tech - Comprehensive CRDT resources
Martin Kleppmann: CRDTs and the Quest for Distributed Consistency - CRDT fundamentals
Gabriel Gambetta: Client-Side Prediction and Server Reconciliation - Game networking patterns applicable to real-time apps