Design a Rich Text Editor

A web rich text editor is the place where browser quirks, IME composition, accessibility, and distributed systems all collide inside a single <div contenteditable>. This article unpacks the four design axes that decide everything else — document model, rendering surface, input pipeline, and collaboration architecture — and grounds them in how Google Docs, Figma, Notion, Linear, and Meta’s own editors actually work today.

Rich text editor architecture: input events flow through the transaction engine, updating immutable state that the DOM reconciler renders. Collaboration layers intercept transactions for sync. — Anatomy of a modern editor: input events fan into a transaction engine, the engine produces immutable state, the reconciler syncs the DOM, and the collaboration layer rides on the same transaction stream.

Mental model

Every modern editor reduces to the same five-piece pipeline:

Input surface — contentEditable, hidden <textarea>, or synthetic event capture. Owns IME, paste, drag-and-drop.
Transaction engine — turns input events into atomic, reversible operations against a typed schema.
Document model — a flat operation log (Quill Delta) or a hierarchical tree of nodes with inline marks (ProseMirror, Slate, Lexical).
Reconciler — projects model state to the DOM, then maps DOM selection back to model positions.
Collaboration layer — rebases local transactions against remote ones using Operational Transform (OT) or a Conflict-free Replicated Data Type (CRDT).

The interesting design decisions live at the boundaries: how much of the input pipeline you delegate to the browser, whether the model is linear or hierarchical, and which collaboration model your transaction shape can support.

Why rich text on the web is hard

contentEditable was designed for forms, not document authoring. The four sharp edges every editor has to handle:

contentEditable inconsistency. Identical user gestures produce different DOM mutations across Chrome, Firefox, and Safari (e.g., the markup the browser inserts when you press Enter in an empty heading varies between <div>, <p>, and <br>).
IME composition. Input Method Editors for CJK and many other scripts compose characters across many keystrokes; the editor must allow composition to commit through the browser, not block it.
Selection edge cases. Cursor placement at zero-width block boundaries, triple-click semantics, and selection across nested blocks have no fully spec-defined behavior.
Undo granularity. What counts as one undoable action — a keystroke, a word boundary, a formatting change, a transaction batch — is a product decision the browser will not make for you.

document.execCommand — the only standardized rich-text formatting API — is now formally deprecated; the W3C Editing Working Group’s draft explicitly tells authors to use a JavaScript editor library instead and that the spec “is not expected to advance beyond [its] draft status”¹. The intended successors are split: Input Events Level 2 for intercepting edits (§5 of the W3C WD), and the EditContext API (W3C WD, shipped in Chromium 121 in January 2024) for decoupling editing state from the DOM and giving custom-rendered editors (canvas, virtualized lists) first-class IME support². Cross-browser EditContext support is still partial — Firefox and Safari do not implement it as of 2026-Q2 — so every modern editor still ships its own command layer on top of beforeinput.

Tip

If you only need a plain-text input that grows with its content, <div contenteditable="plaintext-only"> reached Baseline (Newly available) in March 2025. It strips formatting from paste, disables execCommand style commands, and keeps you out of the rich-text rabbit hole entirely.

Browser constraints worth budgeting for

Constraint	Impact	Mitigation
16ms frame budget at 60fps	Heavy DOM mutation blocks typing	Transaction batching, incremental rendering
DOM mutation overhead	Frequent updates cause layout thrashing	Virtual DOM or custom reconciler
Selection API gaps	Cannot programmatically place cursor in some empty states	Shadow / virtual selection tracking
`execCommand` deprecation	No standard formatting API	Custom command system per editor framework
`beforeinput` IME exception	`insertCompositionText` events are not cancelable	Allow composition; commit at `compositionend`

Scale factors that change the architecture

Factor	Small Scale	Large Scale
Document size	< 1,000 words	> 100,000 words
Concurrent editors	1–2	50+
Nesting depth	2–3 levels	Unlimited (outliners, nested tables)
Update frequency	< 1 / sec	> 10 / sec per user

Large documents force virtualization (rendering only visible blocks) and efficient position mapping. High concurrency demands robust conflict resolution — and that is the point at which your model choice locks you in.

Document model representations

The model is the highest-leverage decision in the architecture. Three shapes dominate; everything else is a variation on them.

Three document model shapes: Quill Delta as a linear op log, ProseMirror's flat marks over a node tree, Lexical and Slate as a typed node tree with decorator escape hatches. — Three document model shapes for the same content. Delta is a linear op log; ProseMirror is a node tree with marks held flat on inline runs; Lexical and Slate are typed node trees with decorator nodes for arbitrary embedded UI.

The model dictates everything downstream. Linear ops compose under OT but make non-linear structure (nested tables, columns, pivot blocks) painful. Tree+marks make structural edits cheap but require a normalization or schema layer to stay valid. Typed node trees with decorators (Lexical, Slate) give you the most freedom — and the most rope.

The two axes that pick your editor

The first decision is the document model: a linear sequence of operations, or a hierarchical tree of typed nodes with marks. The second decision is the rendering surface: lean on contentEditable, control it tightly via a reconciler, or replace it with a synthetic input pipeline. Together they generate four practical paths.

Axis	Linear (Delta)	Hierarchical (Node tree + marks)
`contentEditable`	Quill	Tiptap-flavored ProseMirror
Custom render	(rare; usually upgraded to a tree)	Lexical, modern Slate

Path 1 — `contentEditable` over a linear delta (Quill)

How it works. Quill renders into a contentEditable element, observes mutations, and translates them into Delta operations — a sequence of insert, retain, and delete ops with attribute payloads. The Delta itself is your document.

1import Quill from "quill"23const doc = {4  ops: [5    { insert: "Hello " },6    { insert: "World", attributes: { bold: true } },7    { insert: "\n" },8  ],9}1011const formatBold = {12  ops: [13    { retain: 6 },14    { retain: 5, attributes: { bold: true } },15  ],16}

Best for. Comment composers, chat boxes, blog post editors. Slack historically used Quill for its message composer³; the format is OT-friendly and the browser gives you native typing feel and IME for free.

Trade-offs.

✅ Fast time to ship, native typing feel, native spell check and IME.
✅ Linear ops compose cleanly under OT.
❌ Browser inconsistencies leak through anywhere the model deviates from “paragraph of styled inline text”.
❌ Tables, nested lists, and arbitrary block embeds are awkward — you are continually fighting the browser to enforce structure.

Path 2 — Hierarchical model with controlled `contentEditable` (ProseMirror, Tiptap)

How it works. Define a strict Schema that names every node and mark and constrains valid nesting. All edits flow through Transaction objects against an immutable EditorState; a view keeps contentEditable in sync with state and parses unexpected mutations back into transactions⁴.

1import { Schema, NodeSpec, MarkSpec } from "prosemirror-model"23const nodes: Record<string, NodeSpec> = {4  doc: { content: "block+" },5  paragraph: {6    content: "inline*",7    group: "block",8    parseDOM: [{ tag: "p" }],9    toDOM: () => ["p", 0],10  },11  heading: {12    attrs: { level: { default: 1 } },13    content: "inline*",14    group: "block",15    parseDOM: [1, 2, 3, 4, 5, 6].map((level) => ({16      tag: `h${level}`,17      attrs: { level },18    })),19    toDOM: (node) => [`h${node.attrs.level}`, 0],20  },21  text: { group: "inline" },22}2324const marks: Record<string, MarkSpec> = {25  bold: {26    parseDOM: [{ tag: "strong" }, { tag: "b" }],27    toDOM: () => ["strong", 0],28  },29  link: {30    attrs: { href: {} },31    parseDOM: [{ tag: "a[href]", getAttrs: (dom) => ({ href: (dom as HTMLElement).getAttribute("href") }) }],32    toDOM: (node) => ["a", { href: node.attrs.href }, 0],33  },34}3536const schema = new Schema({ nodes, marks })

1import { EditorState, Transaction } from "prosemirror-state"23function applyBold(state: EditorState): Transaction {4  const { from, to } = state.selection5  return state.tr.addMark(from, to, schema.marks.bold.create())6}78const newState = state.apply(transaction)

Best for. CMS authoring, documentation tools, and anywhere editorial structure must be enforced. The New York Times’ newsroom CMS, Oak, is built on ProseMirror with React and Redux⁵; the team has since open-sourced @nytimes/react-prosemirror for React-first rendering.

Trade-offs.

✅ Schema rejects invalid documents at the boundary; you cannot nest a heading inside a paragraph by accident.
✅ Transactions are atomic and serializable; collab, undo, and time-travel debugging fall out of the same primitive.
✅ Mature collaboration via prosemirror-collab.
❌ Steep learning curve; the API surface is large and intentionally minimal at the same time.
❌ Still relies on contentEditable, so you inherit the browser’s behavior in the corners ProseMirror does not patch.

Path 3 — Fully custom rendering (Lexical, modern Slate)

How it works. Lexical attaches to a single contentEditable element but reconciles the DOM itself rather than letting React (or any UI framework) drive updates — Meta’s stated goal is lower latency and full control over keystroke-to-pixel time⁶. Mutations only happen inside editor.update(() => …) callbacks; reads use editor.read(). State is a tree of typed nodes (paragraph, text, custom decorator) with their own createDOM / updateDOM / decorate methods.

1import { $getRoot, $createParagraphNode, $createTextNode } from "lexical"23editor.update(() => {4  const root = $getRoot()5  const paragraph = $createParagraphNode()6  const text = $createTextNode("Hello World")7  paragraph.append(text)8  root.append(paragraph)9})1011editor.getEditorState().read(() => {12  const root = $getRoot()13  console.log(root.getTextContent())14})

The $-prefix convention is deliberate: $getRoot(), $getSelection(), and the rest only work inside an update or read context — analogous to React Hooks, calling them outside throws.

Best for. Pixel-perfect rendering, complex inline embeds (mentions, polls, attachments), and anywhere you need a small, tree-shakable core with a plugin-driven feature set. Meta uses Lexical for the composers across Facebook, Workplace, Messenger, WhatsApp Web, and Instagram Messenger⁶.

Trade-offs.

✅ Complete control; framework-agnostic core; explicit reconciliation.
✅ Headless: bring your own React / Vue / Svelte / vanilla bindings.
✅ Small core (≈22 KB min+gzip) — but realistic editors pull in selection, list, link, history, rich-text plugins on top.
❌ IME composition is the hardest part of any custom editor; you do less of it than ProseMirror but more of it than Quill.
❌ Accessibility is your responsibility — contentEditable carries some ARIA defaults; the more you replace, the more you must re-implement.

Choosing among them

Use the decision tree below as a starting point. The collaboration question (offline-first versus server-coordinated real-time) tends to dominate any other factor.

Editor framework decision tree. — A first-cut decision tree for picking among Quill, ProseMirror, Slate, and Lexical, then layering Yjs for offline-first collaboration.

Factor	Quill (`contentEditable`)	ProseMirror (controlled)	Lexical (custom)	Slate (custom + plugins)
Bundle size	~57 KB gzip (Quill 2.x)	~50 KB+ gzip (model+state+view core)	~22 KB gzip core; plugins extra	~50 KB+ gzip (with `slate-react`)
Browser consistency	Poor	Good	Excellent	Good
Complex structures	Limited	Full (schema-enforced)	Full (custom nodes)	Full (schemaless, normalized)
IME handling	Native	Native	Mostly manual	Mostly manual
Accessibility	Native	Native + ARIA augmentation	Manual	Manual
Collaboration	OT-friendly (Delta)	OT (`prosemirror-collab`) or Yjs	Yjs via `@lexical/yjs`	Yjs via `slate-yjs`
Extensibility	Plugins	Schema + plugins	Nodes + commands + plugins	Plugins as editor decorators

Note

Bundle sizes are indicative of the core package only. Real editors layer toolbars, history, code blocks, lists, links, tables, and serializers on top — assume a production editor lands closer to 80–250 KB gzip regardless of framework.

Document models, in depth

ProseMirror — flat marks over a node tree

ProseMirror’s document is a tree of block-level nodes, but inline content inside each block is a flat sequence of text spans with a set of Mark annotations, not a nested DOM-style tree⁷. Bold + italic text is one text run with marks: [bold, italic], not <strong><em>…</em></strong>. This is the design choice that makes everything else easier:

Toggling a format never restructures the tree.
Overlapping ranges (partial bold, partial italic) are trivially representable.
Document positions are simple integer offsets into the inline sequence — not tree paths — which makes position mapping cheap.
Marks are merged when applied twice; they do not nest in the model even though toDOM may render them as nested elements.

Position mapping is the foundation of prosemirror-collab:

1import { Mapping, StepMap } from "prosemirror-transform"23// StepMap ranges come in triples: [oldStart, oldSize, newSize].4// "At position 3, replace 0 chars with 5 chars."5const map = new StepMap([3, 0, 5])6const newPos = map.map(10) // 15

Slate — schemaless, normalized

Slate inverts ProseMirror’s stance: there is no schema. The document is a recursive tree of Element and Text nodes, and you enforce structure through normalization — a function that runs after every change and either fixes one invariant or no-ops⁸.

1import { Editor, Transforms, Element, Node } from "slate"23const withParagraphsNormalized = (editor: Editor): Editor => {4  const { normalizeNode } = editor56  editor.normalizeNode = ([node, path]) => {7    if (Element.isElement(node) && node.type === "paragraph") {8      for (const [child, childPath] of Node.children(editor, path)) {9        if (Element.isElement(child) && !editor.isInline(child)) {10          Transforms.unwrapNodes(editor, { at: childPath })11          return // one fix per pass12        }13      }14    }15    normalizeNode([node, path])16  }1718  return editor19}

The “return after one fix” idiom matters: Slate re-runs normalizeNode after each change until the document stabilizes. Returning early keeps each fix atomic and avoids infinite loops. For multi-step edits where intermediate states would fail normalization, wrap with Editor.withoutNormalizing(editor, () => …) and pay one normalization at the end.

Quill Delta — linear, OT-native

Delta represents a document as the result of insert / retain / delete operations applied to an empty document. The same data structure represents both documents and edits, which is why Delta is a natural fit for OT.

1const doc = {2  ops: [3    { insert: "Hello " },4    { insert: "World", attributes: { bold: true } },5    { insert: "\n" },6  ],7}89const edit = {10  ops: [11    { retain: 6 },12    { insert: "Beautiful " },13  ],14}1516// Composed: "Hello Beautiful World\n"

Concurrent edits compose via Delta’s transform function, with explicit tie-break rules so that transform(A, B, "left") and transform(B, A, "right") converge. Without those rules, two clients applying insert "X" and insert "Y" at the same offset would diverge.

Lexical — nodes plus commands

Lexical splits the model surface into three concerns⁹:

Nodes. RootNode, ParagraphNode, TextNode, plus app-defined ones extending ElementNode, TextNode, or DecoratorNode.
Commands. Strings dispatched via editor.dispatchCommand(COMMAND, payload); built-ins include FORMAT_TEXT_COMMAND, INSERT_PARAGRAPH_COMMAND, and the cursor / selection commands.
Listeners. Registered via editor.registerCommand(COMMAND, listener, priority). Returning true stops propagation; priorities (COMMAND_PRIORITY_EDITOR through COMMAND_PRIORITY_CRITICAL) let plugins layer behavior cleanly.

DecoratorNode is the escape hatch for embedding arbitrary UI (mentions, polls, image attachments, code blocks rendered with Shiki) — Lexical hands you a host element via createDOM() and renders whatever your decorate() returns into it via React.createPortal.

1import {2  DecoratorNode,3  NodeKey,4  SerializedLexicalNode,5  Spread,6} from "lexical"78type SerializedMentionNode = Spread<9  { userId: string; displayName: string },10  SerializedLexicalNode11>1213export class MentionNode extends DecoratorNode<JSX.Element> {14  __userId: string15  __displayName: string1617  static getType(): string {18    return "mention"19  }2021  static clone(node: MentionNode): MentionNode {22    return new MentionNode(node.__userId, node.__displayName, node.__key)23  }2425  constructor(userId: string, displayName: string, key?: NodeKey) {26    super(key)27    this.__userId = userId28    this.__displayName = displayName29  }3031  createDOM(): HTMLElement {32    const span = document.createElement("span")33    span.className = "mention"34    return span35  }3637  updateDOM(): false {38    return false39  }4041  decorate(): JSX.Element {42    return <MentionComponent userId={this.__userId} name={this.__displayName} />43  }4445  static importJSON(serialized: SerializedMentionNode): MentionNode {46    return new MentionNode(serialized.userId, serialized.displayName)47  }4849  exportJSON(): SerializedMentionNode {50    return {51      type: "mention",52      version: 1,53      userId: this.__userId,54      displayName: this.__displayName,55    }56  }57}

Input handling

A keystroke is the start of a longer pipeline. The browser fires beforeinput; the editor maps it to a command, the command produces a transaction, the transaction lands on a new immutable state, and the reconciler diffs that state into the DOM. The collaboration plugin and the history plugin both observe the same transaction stream.

Editor transaction pipeline: keystroke fires beforeinput, the editor dispatches a command, the command builds a transaction validated against the schema, state.apply produces a new immutable EditorState, the reconciler diffs into the DOM, and history and collab plugins observe the same stream. — Every keystroke flows through the same pipeline: beforeinput → command → transaction → schema-validated state.apply → reconciler. History and collab are plugins that ride on the transaction stream, not separate code paths.

Input Events Level 2 and `beforeinput`

The W3C Input Events Level 2 Working Draft defines beforeinput, fired before the browser modifies the DOM and (for almost every inputType) cancelable. This is the single best interception point for a custom editor.

1element.addEventListener("beforeinput", (e: InputEvent) => {2  switch (e.inputType) {3    case "insertText":4      e.preventDefault()5      insertText(e.data!)6      break78    case "insertParagraph":9      e.preventDefault()10      insertParagraph()11      break1213    case "deleteContentBackward":14      e.preventDefault()15      deleteBackward()16      break1718    case "insertFromPaste": {19      e.preventDefault()20      const html = e.dataTransfer?.getData("text/html")21      const text = e.dataTransfer?.getData("text/plain")22      handlePaste(html ?? text ?? "")23      break24    }2526    case "insertCompositionText":27      // Spec: not cancelable during IME composition. Sync at compositionend.28      break29  }30})

The most useful inputType values (full enum in §5.1 of the spec):

`inputType`	Trigger	Cancelable
`insertText`	Typing	Yes
`insertParagraph`	Enter	Yes
`insertLineBreak`	Shift+Enter	Yes
`deleteContentBackward`	Backspace	Yes
`deleteContentForward`	Delete	Yes
`insertFromPaste`	Paste	Yes
`insertCompositionText`	IME	No (during composition)
`historyUndo`	Ctrl/Cmd+Z	Yes
`historyRedo`	Ctrl/Cmd+Y	Yes

IME composition — the rule that breaks naive editors

Input Method Editors compose characters across many keystrokes. Pinyin pinyin-to-hanzi, Japanese kana-to-kanji, Korean hangul jamo composition, and many Indic scripts all flow through this state machine. The single rule the spec is unambiguous about: beforeinput events with inputType="insertCompositionText" are not cancelable¹⁰. Trying to preventDefault them silently fails and corrupts the composition UI.

IME composition sequence: keystrokes flow through compositionstart, multiple compositionupdate events, non-cancelable beforeinput insertCompositionText events, and finally compositionend before the editor commits final text. — The IME composition sequence: track isComposing between compositionstart and compositionend; never preventDefault the insertCompositionText beforeinput events; commit text on compositionend.

1let isComposing = false23element.addEventListener("compositionstart", () => {4  isComposing = true5  // Snapshot selection so we can restore on cancel6})78element.addEventListener("compositionupdate", (e: CompositionEvent) => {9  showCompositionPreview(e.data) // provisional only10})1112element.addEventListener("compositionend", (e: CompositionEvent) => {13  isComposing = false14  commitText(e.data) // final committed text15  clearCompositionPreview()16})1718element.addEventListener("beforeinput", (e: InputEvent) => {19  if (e.inputType === "insertCompositionText") {20    return // let the browser drive composition21  }22  // handle other input types as above23})

The pattern: let the browser own the composition lifecycle, sync your model only at compositionend, and treat your model as out-of-sync with the DOM between compositionstart and compositionend. ProseMirror documents this drift explicitly; Lexical isolates it inside the reconciler.

Selection and Range

The Selection API is your bidirectional bridge between DOM coordinates (containers + offsets) and model positions.

1const selection = window.getSelection()2if (!selection || selection.rangeCount === 0) return3const range = selection.getRangeAt(0)45const { startContainer, startOffset, endContainer, endOffset } = range6const isCollapsed = range.collapsed78const rect = range.getBoundingClientRect()9positionToolbar(rect.left, rect.top - 40)1011const newRange = document.createRange()12newRange.setStart(textNode, 5)13newRange.setEnd(textNode, 10)14selection.removeAllRanges()15selection.addRange(newRange)

Selections that span block boundaries make range.commonAncestorContainer resolve to the nearest shared parent — usually the editor root. To enumerate selected nodes you walk the tree between startContainer and endContainer, which is one of the more bug-prone parts of any editor.

Paste sanitization

Paste is the editor’s largest XSS surface. The clipboard payload arrives as DataTransfer on beforeinput with inputType: "insertFromPaste" (or as a paste event when beforeinput is preempted), and text/html from a hostile site is unrestricted HTML — <script>, <iframe>, <img onerror=…>, javascript: URLs, all of it. Inserting that into the editor DOM via innerHTML would be a stored XSS the moment another viewer loads the document.

Paste sanitizer pipeline: clipboard data is parsed in a detached document, Office and Google Docs cruft is stripped, DOMPurify removes script and event handlers, then the editor schema parser coerces the result into valid nodes before a transaction inserts them. — The minimum-viable paste pipeline: parse in a detached DOMParser, strip vendor cruft, run through DOMPurify, then funnel through the editor's own schema parser before the transaction lands.

The minimum-viable pipeline:

Read the best payload from DataTransfer. Prefer text/html for rich sources, fall back to text/plain, special-case files for image upload.
Parse in a detached document. new DOMParser().parseFromString(html, "text/html") — this never executes scripts, never loads images, and never runs CSS the way innerHTML would.
Strip vendor cruft. Microsoft Word and Google Docs both inject hundreds of bytes of mso-* classes, conditional comments, and wrapper spans per pasted character. Drop them before the schema sees them.
Run DOMPurify¹¹ (or the in-development HTML Sanitizer API) to strip <script>, <iframe>, on* event handlers, javascript: / data: URLs in attributes that take URLs, and SVG event handlers. Use a default-deny allowlist; do not try to blocklist tags.
Pass through the editor schema parser. ProseMirror exposes transformPastedHTML, clipboardParser, and transformPasted to inject sanitization at three points in its pipeline; Lexical exposes $generateNodesFromDOM. The schema parser drops any node that does not match the allowed node spec, which is your second line of defense.
Apply as a transaction, not a direct DOM insert. The transaction goes through the same reconciler, undo, and collab paths as a keystroke.

Caution

A restrictive Content-Security-Policy (e.g., style-src 'self') on the parent document is inherited by DOMParser in Chromium, so inline styles from Google Docs and Word paste are silently dropped — Lexical issue #4051 tracks the resulting formatting loss. If your editor depends on round-tripping inline styles from Office paste, allow 'unsafe-inline' for style-src only on the editor route, or strip-and-rebuild styles inside the sanitizer.

Important

Client-side sanitization is a defense in depth, not a substitute for server-side validation. Re-sanitize on the server before persisting and again before rendering to other viewers. The text/html paste payload is attacker-controlled the moment a hostile page is the source.

Collaboration architectures

The choice between Operational Transform and CRDT is rarely about which is “better” in the abstract; it is about whether you can guarantee a central authority for ordering. The trade is exactly that: server authority versus offline correctness.

OT versus CRDT for two users inserting at the same position. OT relies on a server-ordered transform; CRDT compares unique IDs to deterministically merge. — Two users insert at the same position. OT depends on the server picking an order then transforming the second op; CRDT uses unique IDs and a deterministic tie-break, so peer-to-peer merge converges with no server.

Operational Transform

OT lets two clients apply local operations optimistically; the server orders the operations and broadcasts a transform that mutates each operation as if it had been applied after the others. A central server is the source of truth for ordering.

1type Op =2  | { type: "insert"; pos: number; text: string }3  | { type: "delete"; pos: number; len: number }45function transform(op1: Op, op2: Op): Op {6  if (op1.type === "insert" && op2.type === "insert") {7    if (op1.pos < op2.pos) return op18    if (op1.pos > op2.pos) return { ...op1, pos: op1.pos + op2.text.length }9    // Same position — tie-break by clientID elsewhere10    return op111  }1213  if (op1.type === "insert" && op2.type === "delete") {14    if (op1.pos <= op2.pos) return op115    if (op1.pos >= op2.pos + op2.len) return { ...op1, pos: op1.pos - op2.len }16    return { ...op1, pos: op2.pos }17  }1819  // ... and the rest of the matrix20  return op121}

ProseMirror sidesteps the combinatorial explosion of “transform every op type against every op type” by transforming positions through StepMaps rather than transforming ops against ops directly:

1import { sendableSteps, receiveTransaction } from "prosemirror-collab"23// Apply remote steps (server is authority on ordering)4const tr = receiveTransaction(state, remoteSteps, remoteClientIDs)5const newState = state.apply(tr)67// Send local steps (server rebases against any concurrent steps it has seen)8const sendable = sendableSteps(state)9if (sendable) {10  socket.send({11    version: sendable.version,12    steps: sendable.steps.map((s) => s.toJSON()),13    clientID: sendable.clientID,14  })15}

Operational profile.

✅ Battle-tested at Google Docs scale.
✅ Server is the single source of truth — easy to reason about consistency, undo, history, and access control.
❌ The server is a hard dependency and a coordination bottleneck.
❌ Long offline sessions need careful rebase logic; in some OT models, very stale clients have to be force-reset.

CRDTs (Yjs / YATA)

CRDTs embed enough metadata in the data structure that concurrent operations always converge without a coordinator. Yjs implements an optimized variant of YATA (Yet Another Transformation Approach), assigning each item a unique (clientID, clock) ID and using tombstones for deletions¹².

1import * as Y from "yjs"2import { WebsocketProvider } from "y-websocket"3import { ySyncPlugin, yCursorPlugin, yUndoPlugin } from "y-prosemirror"45const ydoc = new Y.Doc()6const provider = new WebsocketProvider("wss://collab.example.com", "doc-room", ydoc)7const yXmlFragment = ydoc.getXmlFragment("prosemirror")89const plugins = [10  ySyncPlugin(yXmlFragment),11  yCursorPlugin(provider.awareness),12  yUndoPlugin(),13]

Conceptually, inserting 'X' after the item with id id3:

1"Hello" → [(id1,'H'), (id2,'e'), (id3,'l'), (id4,'l'), (id5,'o')]23A: insert (idA, 'X', after: id3)4B: delete id4 (mark as tombstone)56Merge: append (idA, 'X') after id3, hide id4 → "HelXo"

Operational profile.

✅ No coordinator — peers can sync directly or through a relay; offline-first comes for free.
✅ Simpler mental model than OT for arbitrary structures (maps, arrays, text, XML).
❌ Per-character metadata costs; a multi-megabyte document can balloon into tens of megabytes of CRDT state without compaction.
❌ Tombstones accumulate. Yjs amortizes this with garbage collection of fully-deleted runs and binary encoding, but it is still an operational concern.
❌ Some “concurrent insert at same position” outcomes are determined by ID order, not user intent — usually fine for prose, occasionally surprising for tabular data.

How the production systems actually do it

The marketing copy for OT vs CRDT is much cleaner than what production systems ship. Almost every large editor is hybrid in some way.

Product	Real architecture	Source
Google Docs	Server-authoritative OT; optimistic local apply; server transforms and broadcasts	Widely cited; see Figma’s blog summary below
Figma	Centralized, CRDT-inspired — server orders all events; LWW for many properties; not pure OT, not pure CRDT	How Figma’s multiplayer technology works (Evan Wallace, 2019)
Notion	Block-based; transactions queued client-side, persisted to IndexedDB / SQLite, sent via `/saveTransactions`; server pushes version updates over WebSocket	The data model behind Notion
Linear	Custom sync engine, not Yjs; in-memory MobX object graph + IndexedDB; last-write-wins by default, with CRDTs only for specific text fields like issue descriptions	Scaling the Linear Sync Engine, and the reverse-engineered architecture writeup

Caution

The popular framing that “Figma migrated from OT to CRDTs” is wrong. Figma evaluated both during prototyping, judged OT “overkill” for design objects, and built a centralized, server-authoritative system inspired by CRDTs — never adopting either as-is. The “Linear is built on Yjs” framing is also wrong; Linear’s core sync engine is proprietary, with Yjs / CRDTs reserved for specific collaborative text fields.

Performance for large documents

Virtualization and chunking

Documents with tens of thousands of blocks need virtualization — render only what is on screen plus a small buffer. The simplest pattern uses IntersectionObserver to track which block sentinels are in view and renders a window of [start - buffer, end + buffer].

1import { useEffect, useState, useRef } from "react"23interface Block {4  id: string5  type: string6  content: string7}89function useVirtualBlocks(blocks: Block[], containerRef: React.RefObject<HTMLElement>) {10  const [visibleRange, setVisibleRange] = useState({ start: 0, end: 20 })1112  useEffect(() => {13    const container = containerRef.current14    if (!container) return1516    const observer = new IntersectionObserver(17      (entries) => {18        const visible = entries19          .filter((e) => e.isIntersecting)20          .map((e) => parseInt((e.target as HTMLElement).dataset.index!, 10))2122        if (visible.length > 0) {23          setVisibleRange({24            start: Math.max(0, Math.min(...visible) - 5),25            end: Math.min(blocks.length, Math.max(...visible) + 5),26          })27        }28      },29      { root: container, threshold: 0 },30    )3132    return () => observer.disconnect()33  }, [blocks.length])3435  return blocks.slice(visibleRange.start, visibleRange.end)36}

Slate exposes an experimental chunking API in slate-react for the same purpose: assign editor.getChunkSize and pass renderChunk to <Editable />¹³. Chunking only works when a node’s children are all blocks (not inline runs), so it is best applied at the editor root.

1import { Editor, Node } from "slate"23const withChunking = (editor: Editor): Editor => {4  editor.getChunkSize = (node: Node) => (Editor.isEditor(node) ? 1000 : null)5  return editor6}78// In your component:9// <Editable renderChunk={renderChunk} ... />

CSS content-visibility: auto gives a complementary layer of paint-skipping for off-screen blocks; combined with a virtualized window, it keeps long documents at 60 fps.

Memory budget

For very large documents the dominant techniques and their trade-offs:

Technique	Use case	Trade-off
Piece tables	Text-heavy documents	Complex range queries; tricky to serialize
Lazy node loading	Outline / tree views	Latency on expand
`WeakMap` caches	Computed values per node	GC unpredictability
LRU eviction	Undo history, decoration cache	Lost undo steps
CRDT GC + binary	Long-lived collaborative docs	Periodic pause for compaction

Accessibility floor

ARIA contracts

A contentEditable element gives you implicit text-input semantics, but assistive tech still benefits from explicit role and labeling. For a multiline editor:

1<div2  role="textbox"3  aria-multiline="true"4  aria-label="Document editor"5  contenteditable="true"6></div>

When you replace contentEditable with a fully synthetic surface (Lexical with custom rendering, large parts of modern Slate), role="application" plus a role="document" inner region is the common pattern, but you must reimplement keyboard semantics that the browser otherwise gave you for free, including arrow-key cursor movement and selection extension. The ARIA Authoring Practices Guide is the authoritative reference for the role/state combinations.

Keyboard contract

Minimum keyboard support a senior engineer should expect:

Key	Action
Arrow keys	Move cursor
Shift + Arrow	Extend selection
Ctrl/Cmd + A	Select all
Ctrl/Cmd + Z / Y / Shift+Z	Undo / redo
Ctrl/Cmd + B / I / U	Bold / italic / underline
Tab	Indent (in lists) or focus next element
Escape	Exit nested block / dismiss popover

Test against the actual combinations users run, not just one screen reader on one OS.

Screen reader	Browser	Platform
NVDA	Firefox, Chrome	Windows
JAWS	Chrome, Edge	Windows
VoiceOver	Safari	macOS, iOS
TalkBack	Chrome	Android

Custom-rendered editors are particularly fragile here: any change to the DOM the screen reader is reading from can be reported as a stream of insert / delete events instead of a coherent edit. Test before you ship.

Common footguns

Fighting `contentEditable`

Symptom. Pressing Enter in a heading spawns a <div> instead of a <p>. Backspace at the start of a list item deletes random structure. Fix. Stop trying to enforce structure inside the browser’s editing handler. Use ProseMirror or Tiptap to control the structure, or Lexical to replace the editing handler entirely.

Ignoring IME

Symptom. Editor works in English but corrupts CJK input — characters double, composition UI flickers, the document state diverges from the screen. Fix. Track isComposing between compositionstart and compositionend. Never preventDefault on inputType: "insertCompositionText". Sync your model on compositionend.

Single-keystroke undo granularity

Symptom. Each character is its own undo step; a one-line edit takes 40 undos to revert.

1let undoTimeout: number | null = null23function onInput() {4  if (undoTimeout) clearTimeout(undoTimeout)5  applyChangeWithoutUndoEntry()6  undoTimeout = window.setTimeout(() => {7    createUndoEntry()8    undoTimeout = null9  }, 500)10}

ProseMirror’s history plugin and Lexical’s history plugin both batch by time and op type out of the box; if you are rolling your own, debounce by ~500 ms and break the batch on word boundaries, formatting changes, and selection-only events.

Memory growth under long collab sessions

Symptom. Memory creeps up over an 8-hour editing session. Fix. Garbage-collect CRDT tombstones periodically, cap undo history depth, and disconnect inactive presence cursors. Yjs documents the GC controls explicitly; OT-based systems must implement equivalent cleanup themselves.

Selection lost across async work

Symptom. After a paste or async lookup, the cursor jumps to the start of the document. Fix. Snapshot selection before the async call, restore before applying changes:

1async function handlePaste(html: string) {2  const savedSelection = editor.selection3  const processed = await processHtml(html)4  editor.withoutNormalizing(() => {5    if (savedSelection) {6      Transforms.select(editor, savedSelection)7    }8    Transforms.insertFragment(editor, processed)9  })10}

Production case studies

Notion — block tree, transaction queue, push-based sync

Every piece of content in Notion is a block — paragraph, image, page, database, embed — keyed by a UUIDv4 and arranged in a tree via parent and content pointers¹⁴. Edits become operations against the local RecordCache, get queued in a TransactionQueue (persisted in IndexedDB or SQLite), and flush to the server via /saveTransactions. The server commits and notifies a MessageStore, which pushes version updates over WebSocket to subscribed clients; clients reconcile their local cache with syncRecordValues.

The block model makes virtualization easy (each block has a known type with an estimated height) and gives Notion permission inheritance for free via the parent chain.

Linear — proprietary sync engine, MobX object graph, LWW

Linear does not use Yjs for its core sync engine. Instead the client maintains a normalized object graph in memory (powered by MobX), persists it to IndexedDB, and replays changes as SyncActions tagged with monotonic IDs. Conflict resolution defaults to last-write-wins, with CRDTs only for specific collaborative text fields (e.g., issue descriptions)¹⁵. WebSockets carry incremental deltas; full bootstrap happens on first load.

Figma — centralized, CRDT-inspired

Figma’s blog post is the clearest single read on this trade space. They evaluated OT, judged it overkill for design objects (where most properties are last-writer-wins-friendly), and built a centralized server that borrows ideas from CRDT literature without committing to any one implementation¹⁶. Different parts of the document use different conflict-resolution strategies — text uses a different model from objects — because the constraints differ.

Google Docs — OT, optimistic local apply

Google Docs uses Operational Transform, with optimistic local application so the editing user sees changes instantly while the server transforms and broadcasts the canonical operation order. The team’s preference for OT is partly historical (the technology was mature when Docs was built) and partly a function of scale — at hundreds of millions of users, the per-character metadata cost of CRDTs is non-trivial.

Practical takeaways

For comment composers and chat input — Quill. Accept the browser inconsistencies; lean on Delta and OT.
For CMS / documentation editors with strong editorial structure — ProseMirror (or Tiptap on top of it). The schema pays for itself the first time it rejects an invalid document.
For pixel-perfect rendering, custom embeds, and a tight bundle — Lexical. Budget for IME and accessibility work.
For schemaless, plugin-first authoring on top of React — Slate. Use normalization aggressively and Editor.withoutNormalizing for batched edits.
For offline-first or peer-to-peer — layer Yjs on whichever editor framework gives you the right authoring shape (y-prosemirror, slate-yjs, @lexical/yjs).
Do not roll your own. If you are building an editor framework from scratch in 2026, you are signing up for a multi-year IME, accessibility, and selection bug backlog. Start from one of the four above.

Appendix

Prerequisites

DOM APIs: Selection, Range, MutationObserver, InputEvent, CompositionEvent.
Event handling: beforeinput, the composition event lifecycle.
React (or comparable) framework fundamentals for ProseMirror / Lexical / Slate bindings.
Working knowledge of OT / CRDT trade-offs at the distributed-systems level.

Terminology

Term	Definition
`contentEditable`	HTML attribute that makes any element host native browser text editing
IME	Input Method Editor — system component for composing characters in CJK and many other scripts
OT	Operational Transform — algorithm for transforming concurrent operations to maintain consistency
CRDT	Conflict-free Replicated Data Type — data structure with metadata that enables coordinator-free merge
Transaction	Atomic, reversible bundle of changes against editor state (ProseMirror / Lexical)
Schema	Definition of the valid node and mark types and their nesting (ProseMirror)
Mark	Inline annotation (bold, italic, link) applied to text without changing the structural tree
Node	Structural element in the document tree (paragraph, heading, list item, …)
Step / StepMap	ProseMirror’s atomic change primitive and its inverse position-mapping object
Tombstone	Marker for a deleted item in a CRDT — preserves intent but accumulates memory

Summary

Models. Linear (Quill Delta) is OT-friendly but limits structure; hierarchical (ProseMirror, Slate, Lexical) enables real document semantics.
Rendering. contentEditable gives native behavior with all its inconsistencies; controlled contentEditable (ProseMirror) trades schema enforcement against a steeper API; custom rendering (Lexical) trades total control against IME and a11y work.
Input. beforeinput is your single best interception point. Composition events are not optional.
Collaboration. OT needs a server; CRDTs scale offline and peer-to-peer at the cost of metadata. Real systems are usually hybrid.
Performance. Virtualize blocks once a document grows past the viewport; chunk in Slate; use content-visibility: auto.
Accessibility. Test the NVDA + JAWS + VoiceOver matrix; do not assume contentEditable defaults will carry you through a custom render.

References

W3C Input Events Level 2 — authoritative beforeinput reference, including the inputType enum and IME cancelability rules.
W3C EditContext API and Chrome for Developers — Introducing the EditContext API — the long-term replacement for contentEditable-only editing on Chromium.
W3C Editing — execCommand draft — the explicit “do not use this; we will not advance this spec” notice.
MDN: beforeinput event and MDN: Document.execCommand().
Cure53 — DOMPurify and WICG HTML Sanitizer API — paste sanitization references.
ProseMirror Guide — the Documents and Collab sections in particular.
Marijn Haverbeke — Collaborative Editing in ProseMirror — design rationale for the Step/StepMap rebasing model.
Lexical — Introduction and Lexical — Commands.
Slate — Normalizing and Slate — Improving Performance.
Quill — Delta format.
Yjs — Internals.
Notion — The data model behind Notion’s flexibility.
Figma — How Figma’s multiplayer technology works.
Linear — Scaling the Linear Sync Engine.
NYT Open — Building a Text Editor for a Digital-First Newsroom.
web.dev — content-visibility: the new CSS property that boosts your rendering performance.

W3C Editing Working Group — execCommand draft (“This specification is incomplete… not expected to be advanced”) and MDN: Document.execCommand(). ↩
W3C EditContext API Working Draft and Chrome for Developers — Introducing the EditContext API. ↩
Slack’s WYSIWYG composer launched on Quill in 2019; the rich-text wire format is now Slack’s own rich_text block format, but the historical attribution to Quill is widely repeated by the editor community (e.g., Liveblocks’ 2025 framework comparison). ↩
ProseMirror Guide — Documents and ProseMirror Guide — Collab. ↩
NYT Open — Building a Text Editor for a Digital-First Newsroom (Sophia Ciocca, NYT engineering blog). ↩
Lexical — Introduction. The page reports a 22 KB min+gzip core, lists the Meta surfaces it powers (Facebook, Workplace, Messenger, WhatsApp Web, Instagram Messenger), and documents the $-prefixed API. ↩ ↩²
ProseMirror Guide — Documents and the Rationale for marks thread by Marijn Haverbeke. ↩
Slate — Normalizing. ↩
Lexical — Commands and Lexical — Nodes. ↩
W3C Input Events Level 2 — inputType attribute. The cancelability table explicitly lists composition events as non-cancelable. ↩
Cure53 — DOMPurify and DOMPurify — How sanitization works. ↩
Yjs Docs — Internals. ↩
Slate — Improving Performance. ↩
The data model behind Notion’s flexibility. ↩
Scaling the Linear Sync Engine and reverse-engineering writeup (community). ↩
How Figma’s multiplayer technology works. ↩