JavaScript String Length: Graphemes, UTF-16, and Unicode

JavaScript’s string.length returns UTF-16 code units—a 1995 design decision that predates Unicode’s expansion beyond 65,536 characters. This causes '👨‍👩‍👧‍👦'.length to return 11 instead of 1, breaking character counting, truncation, and cursor positioning for any text containing emoji, combining marks, or supplementary plane characters. Understanding the three abstraction layers—grapheme clusters, code points, and code units—is essential for correct Unicode handling.

Image with different non-text icons — Photo by Maria Cappelli on Unsplash

Text exists at multiple abstraction layers: grapheme clusters (what users see), code points (Unicode characters), and code units (what JavaScript counts)

Abstract

Text exists at three abstraction layers, and JavaScript operates at the wrong one for user-facing operations:

Layer	What It Represents	JavaScript API	`'👨‍👩‍👧‍👦'`
Grapheme clusters	User-perceived characters	`Intl.Segmenter`	1
Code points	Unicode abstract characters	`[...str]`, `codePointAt()`	7
Code units	UTF-16 encoding units	`string.length`, `charAt()`	11

The design constraint: JavaScript adopted Java’s UCS-2 encoding in 1995 when Unicode fit in 16 bits. When Unicode expanded (UTF-16, 1996), JavaScript preserved backward compatibility by exposing surrogate pairs as separate characters rather than breaking existing code.

The solution: Use Intl.Segmenter (Baseline April 2024) for grapheme-aware operations. For code point operations, use spread [...str] or the u regex flag. Reserve string.length for byte-level operations only.

Critical gotchas:

string.slice(0, n) can corrupt emoji by splitting surrogate pairs
split('') breaks supplementary plane characters into invalid halves
ES2024 added isWellFormed() and toWellFormed() to detect and fix lone surrogates

The Problem: What You See vs. What You Get

Per ECMA-262 §6.1.4, a String is “a finite ordered sequence of zero or more 16-bit unsigned integer values.” The length property returns the number of these 16-bit code units—not characters, not code points, not graphemes.

This design predates Unicode’s expansion beyond the Basic Multilingual Plane (BMP). When emoji and supplementary characters were added, they required surrogate pairs (two code units), but JavaScript couldn’t change length semantics without breaking the web.

1
const logLengths = (...items) => console.log(items.map((item) => `${item}: ${item.length}`))
2

3
// BMP characters (U+0000 to U+FFFF): 1 code unit each
4
logLengths("A", "a", "À", "⇐", "⇟")
5
// ['A: 1', 'a: 1', 'À: 1', '⇐: 1', '⇟: 1']
6

7
// Supplementary plane emoji (U+10000+): 2 code units (surrogate pair)
8
logLengths("🧘", "🌦", "😂", "😃", "🥖", "🚗")
9
// ['🧘: 2', '🌦: 2', '😂: 2', '😃: 2', '🥖: 2', '🚗: 2']
10

11
// ZWJ sequences: multiple code points joined by U+200D
12
logLengths("🧘", "🧘🏻‍♂️", "👨‍👩‍👧‍👦")
13
// ['🧘: 2', '🧘🏻‍♂️: 7', '👨‍👩‍👧‍👦: 11']

The family emoji 👨‍👩‍👧‍👦 consists of 7 code points (4 emoji + 3 ZWJ characters), but 11 code units because each base emoji requires a surrogate pair.

The Solution: Intl.Segmenter

Intl.Segmenter implements Unicode Standard Annex #29 (UAX #29), which defines grapheme cluster boundaries. It became Baseline in April 2024 and is defined in ECMA-402 (ECMAScript Internationalization API).

Why a separate API? Grapheme cluster rules are complex and locale-sensitive. UAX #29 includes 13+ rules covering Hangul syllables, Indic conjuncts, emoji ZWJ sequences, and regional indicators. This logic doesn’t belong in core string methods.

1
function getGraphemeLength(str: string, locale = "en"): number {
2
  return [...new Intl.Segmenter(locale, { granularity: "grapheme" }).segment(str)].length
3
}
4

5
console.log("👨‍👩‍👧‍👦".length) // 11 (code units)
6
console.log(getGraphemeLength("👨‍👩‍👧‍👦")) // 1 (grapheme)
7

8
// Iterate over grapheme clusters
9
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
10
for (const { segment, index } of segmenter.segment("👨‍👩‍👧‍👦🌦️🧘🏻‍♂️")) {
11
  console.log(`'${segment}' at code unit index ${index}`)
12
}
13
// '👨‍👩‍👧‍👦' at code unit index 0
14
// '🌦️' at code unit index 11
15
// '🧘🏻‍♂️' at code unit index 14

The granularity option supports "grapheme" (default), "word", and "sentence". Word segmentation is particularly valuable for languages like Thai, Japanese, and Chinese that don’t use whitespace between words.

Why UTF-16? The Historical Constraints

The UCS-2 Era (1991-1996)

Unicode 1.0 (1991) assumed 65,536 characters would suffice for all human writing systems. UCS-2 (Universal Character Set, 2-byte) encoded each code point directly as a 16-bit integer. Java adopted UCS-2 in 1995, and JavaScript—designed to “look like Java”—followed suit.

The critical assumption: 16 bits = all of Unicode. This held for 5 years.

Unicode Expands (1996)

Unicode 2.0 (1996) introduced supplementary planes to support historic scripts, mathematical symbols, and eventually emoji. The encoding expanded from 65,536 to 1,114,112 possible code points (17 planes × 65,536 each).

UTF-16 extended UCS-2 with surrogate pairs: code points above U+FFFF are encoded as two 16-bit code units. The BMP range D800-DFFF was reserved for surrogates, ensuring backward compatibility—existing UCS-2 text remained valid UTF-16.

JavaScript’s Compatibility Constraint

JavaScript couldn’t break existing code by changing length semantics. The solution: expose UTF-16 code units directly. Per Mathias Bynens’ analysis, JavaScript engines use “UCS-2 with surrogates”—they allow lone surrogates that strict UTF-16 would reject.

1
// Lone surrogate (invalid UTF-16, valid JavaScript string)
2
const invalid = "\uD800" // High surrogate without low surrogate
3
invalid.length // 1
4
invalid.isWellFormed() // false (ES2024)

Design trade-off: Backward compatibility over correctness. Existing string manipulation code continued to work, but character counting became unreliable for supplementary characters.

Unicode Architecture: Planes and Surrogate Pairs

The 17 Unicode Planes

Unicode allocates 1,114,112 code points across 17 planes. Only planes 0-3 and 14-16 are currently assigned:

Plane	Range	Name	JavaScript Implications
0	U+0000–U+FFFF	Basic Multilingual Plane (BMP)	1 code unit per character
1	U+10000–U+1FFFF	Supplementary Multilingual Plane (SMP)	2 code units; contains most emoji
2-3	U+20000–U+3FFFF	Ideographic Extension Planes	2 code units; rare CJK characters
14	U+E0000–U+EFFFF	Special-purpose Plane	Variation selectors, language tags
15-16	U+F0000–U+10FFFF	Private Use Areas	Application-defined; 2 code units

Surrogate Pair Encoding

Supplementary characters (U+10000+) are encoded using a mathematical transformation that maps them to surrogate pairs:

1
// Encoding algorithm for code point → surrogate pair
2
function toSurrogatePair(codePoint: number): [number, number] {
3
  const temp = codePoint - 0x10000
4
  const high = Math.floor(temp / 0x400) + 0xd800 // D800-DBFF
5
  const low = (temp % 0x400) + 0xdc00 // DC00-DFFF
6
  return [high, low]
7
}
8

9
// U+1F4A9 (💩) → [0xD83D, 0xDCA9]
10
toSurrogatePair(0x1f4a9) // [55357, 56489]
11

12
// Verification
13
"💩".charCodeAt(0) // 55357 (0xD83D) - high surrogate
14
"💩".charCodeAt(1) // 56489 (0xDCA9) - low surrogate
15
"💩".codePointAt(0) // 128169 (0x1F4A9) - full code point

The ranges D800-DBFF (high surrogates) and DC00-DFFF (low surrogates) are permanently reserved in the BMP—they never represent characters directly.

Unsafe String Operations

JavaScript’s pre-ES6 string methods operate on code units, creating several failure modes:

Corruption via Slicing

1
const text = "Hello 💩 World"
2

3
// Dangerous: arbitrary index may split surrogate pair
4
text.substring(0, 7) // "Hello �" - corrupted high surrogate
5

6
// Safe: slice at code point boundaries
7
const chars = [...text]
8
chars.slice(0, 7).join("") // "Hello 💩"

Invalid Split Results

1
// Dangerous: splits surrogate pairs into invalid characters
2
"💩".split("") // ['\uD83D', '\uDCA9'] - two invalid strings
3

4
// Safe: spread preserves code points
5
[..."💩"] // ['💩']

charAt and charCodeAt Limitations

1
const emoji = "💩"
2

3
emoji.length      // 2 (code units, not characters)
4
emoji.charAt(0)   // '\uD83D' (lone high surrogate - invalid)
5
emoji.charAt(1)   // '\uDCA9' (lone low surrogate - invalid)
6
emoji.charCodeAt(0) // 55357 (high surrogate value)
7

8
// Safe alternatives
9
emoji.codePointAt(0) // 128169 (full code point)
10
emoji.at(0)          // '\uD83D' (still code unit based - ES2022)
11
[...emoji][0]        // '💩' (code point)

ES2024: Well-Formed String Detection

ES2024 added isWellFormed() and toWellFormed() to detect and fix lone surrogates—a common source of bugs when interfacing with APIs that require valid UTF-16:

1
const wellFormed = "Hello 💩"
2
const illFormed = "Hello \uD800" // lone high surrogate
3

4
wellFormed.isWellFormed() // true
5
illFormed.isWellFormed() // false
6

7
// Fix by replacing lone surrogates with U+FFFD (replacement character)
8
illFormed.toWellFormed() // "Hello �"
9

10
// Use case: encodeURI throws on ill-formed strings
11
encodeURI(illFormed) // URIError
12
encodeURI(illFormed.toWellFormed()) // "Hello%20%EF%BF%BD"

Safe Unicode Operations

Code Point Iteration (ES6+)

The for...of loop and spread operator iterate over code points, not code units:

1
const text = "A💩Z"
2

3
// Code unit iteration (broken)
4
for (let i = 0; i < text.length; i++) {
5
  console.log(text[i]) // 'A', '\uD83D', '\uDCA9', 'Z' - 4 items
6
}
7

8
// Code point iteration (correct for supplementary characters)
9
for (const char of text) {
10
  console.log(char) // 'A', '💩', 'Z' - 3 items
11
}
12

13
// Spread also iterates code points
14
[...text] // ['A', '💩', 'Z']
15
[...text].length // 3 (code points)
16
text.length // 4 (code units)

Limitation: Code point iteration still doesn’t handle grapheme clusters. [..."👨‍👩‍👧‍👦"] returns 7 elements (4 emoji + 3 ZWJ), not 1.

Unicode-Aware Regular Expressions

The u flag (ES6) enables code point matching and Unicode property escapes:

1
// Without u flag: . matches one code unit
2
/^.$/.test("💩") // false (💩 is 2 code units)
3

4
// With u flag: . matches one code point
5
/^.$/u.test("💩") // true
6

7
// Unicode property escapes (ES2018)
8
/\p{Emoji}/u.test("💩")           // true
9
/\p{Script=Latin}/u.test("A")    // true
10
/\p{Script=Cyrillic}/u.test("а") // true (Cyrillic 'а', not Latin 'a')
11

12
// Extended grapheme cluster matching (ES2024 /v flag)
13
/^\p{RGI_Emoji}$/v.test("👨‍👩‍👧‍👦") // true - matches full ZWJ sequence

String Normalization

Unicode allows multiple representations of the same visual character. Normalize before comparison:

1
const e1 = "é" // U+00E9 (precomposed: single code point)
2
const e2 = "e\u0301" // U+0065 + U+0301 (decomposed: base + combining acute)
3

4
e1 === e2 // false (different code points)
5
e1.normalize() === e2.normalize() // true (NFC: canonical composition)
6
e1.length // 1
7
e2.length // 2
8

9
// Normalization forms
10
"é".normalize("NFC") // Canonical Composition (default)
11
"é".normalize("NFD") // Canonical Decomposition
12
"é".normalize("NFKC") // Compatibility Composition
13
"é".normalize("NFKD") // Compatibility Decomposition

When to normalize: Always normalize user input before storage or comparison. Use NFC (default) for general text; NFKC for search/matching where compatibility equivalents should match.

Full-Stack Unicode Considerations

Database Storage: The MySQL utf8 Trap

MySQL’s utf8 charset is actually utf8mb3—a 3-byte encoding that cannot store supplementary plane characters:

1
-- Dangerous: silently truncates emoji or throws errors
2
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8);
3
INSERT INTO users VALUES ('Hello 💩'); -- Error or truncation
4

5
-- Safe: full UTF-8 support (4 bytes per character max)
6
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8mb4);

As of MySQL 8.0, utf8 is deprecated and utf8mb4 is the default. PostgreSQL’s UTF8 encoding has always supported the full Unicode range.

API Design

1
// Always specify encoding in Content-Type
2
res.setHeader("Content-Type", "application/json; charset=utf-8")
3

4
// Normalize input at system boundary
5
const sanitized = userInput
6
  .normalize("NFC") // Canonical composition
7
  .replace(/\p{C}/gu, "") // Remove control characters
8

9
// Validate well-formedness before external APIs
10
if (!sanitized.isWellFormed()) {
11
  throw new Error("Invalid Unicode input")
12
}

Character Limits in APIs

When enforcing length limits, decide which layer you’re measuring:

1
const MAX_GRAPHEMES = 280 // Twitter-style limit
2

3
function validateLength(text: string): boolean {
4
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
5
  const graphemeCount = [...segmenter.segment(text)].length
6
  return graphemeCount <= MAX_GRAPHEMES
7
}
8

9
// "👨‍👩‍👧‍👦".repeat(280) passes - 280 graphemes
10
// But it's 11 * 280 = 3080 code units for storage

Edge Cases and Failure Modes

Cursor Positioning

Text editors must position cursors between grapheme clusters, not code units or code points:

1
// User sees: "👨‍👩‍👧‍👦" (1 character)
2
// Code units: 11 positions
3
// Code points: 7 positions
4
// Valid cursor positions: 2 (before and after)
5

6
function getCursorPositions(text: string): number[] {
7
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
8
  const positions = [0]
9
  for (const { index, segment } of segmenter.segment(text)) {
10
    positions.push(index + segment.length)
11
  }
12
  return positions
13
}
14

15
getCursorPositions("👨‍👩‍👧‍👦") // [0, 11] - only 2 valid positions

Visual Spoofing (Homograph Attacks)

Different scripts contain visually identical characters:

1
const cyrillicA = "а" // U+0430 (Cyrillic Small Letter A)
2
const latinA = "a" // U+0061 (Latin Small Letter A)
3

4
cyrillicA === latinA // false
5
cyrillicA.normalize() === latinA.normalize() // false (different scripts)
6

7
// Detecting mixed scripts
8
function detectMixedScripts(text: string): boolean {
9
  const scripts = new Set<string>()
10
  for (const char of text) {
11
    // Check script property (simplified)
12
    if (/\p{Script=Latin}/u.test(char)) scripts.add("Latin")
13
    if (/\p{Script=Cyrillic}/u.test(char)) scripts.add("Cyrillic")
14
    // Add more scripts as needed
15
  }
16
  return scripts.size > 1
17
}
18

19
detectMixedScripts("pаypal.com") // true - mixed Latin and Cyrillic

Buffer Size Calculations

UTF-8 byte length differs from code unit count:

1
// Code units vs UTF-8 bytes
2
const text = "Hello 💩"
3
text.length // 8 code units
4
Buffer.byteLength(text, "utf8") // 10 bytes (💩 = 4 bytes in UTF-8)
5

6
// Safe buffer allocation
7
function safeBuffer(text: string): Buffer {
8
  return Buffer.from(text, "utf8") // Auto-sizes correctly
9
}
10

11
// Dangerous: assumes code units = bytes
12
const buf = Buffer.alloc(text.length) // Only 8 bytes
13
buf.write(text) // Truncation risk

Combining Characters and Length

Combining marks attach to preceding base characters:

1
const composed = "é" // U+00E9 (1 code point)
2
const decomposed = "e\u0301" // U+0065 + U+0301 (2 code points)
3

4
composed.length // 1
5
decomposed.length // 2
6

7
// Both render identically, but different lengths
8
// Always normalize before length checks

Practical Patterns

Safe String Truncation

1
function truncateGraphemes(text: string, maxGraphemes: number, ellipsis = "…"): string {
2
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
3
  const segments = [...segmenter.segment(text)]
4

5
  if (segments.length <= maxGraphemes) return text
6

7
  return (
8
    segments
9
      .slice(0, maxGraphemes - 1)
10
      .map((s) => s.segment)
11
      .join("") + ellipsis
12
  )
13
}
14

15
truncateGraphemes("Hello 👨‍👩‍👧‍👦 World", 8) // "Hello 👨‍👩‍👧‍👦…"
16
// Code unit truncation would break: "Hello 👨‍👩‍👧‍👦 World".slice(0, 8) = "Hello �"

Username Validation

1
function validateUsername(username: string): { valid: boolean; error?: string } {
2
  const normalized = username.normalize("NFC")
3

4
  // Check well-formedness
5
  if (!normalized.isWellFormed()) {
6
    return { valid: false, error: "Invalid Unicode" }
7
  }
8

9
  // Count graphemes
10
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
11
  const graphemeCount = [...segmenter.segment(normalized)].length
12

13
  if (graphemeCount < 3 || graphemeCount > 20) {
14
    return { valid: false, error: "Username must be 3-20 characters" }
15
  }
16

17
  // Detect mixed scripts (homograph defense)
18
  const hasLatin = /\p{Script=Latin}/u.test(normalized)
19
  const hasCyrillic = /\p{Script=Cyrillic}/u.test(normalized)
20
  if (hasLatin && hasCyrillic) {
21
    return { valid: false, error: "Mixed scripts not allowed" }
22
  }
23

24
  return { valid: true }
25
}

Normalize-on-Boundary Pattern

Normalize text at system boundaries (input/output), process internally with consistent encoding:

1
class TextProcessor {
2
  // Normalize and validate on input
3
  static ingest(raw: string): string {
4
    const normalized = raw.normalize("NFC")
5
    if (!normalized.isWellFormed()) {
6
      throw new Error("Ill-formed Unicode input")
7
    }
8
    return normalized
9
  }
10

11
  // Internal processing on normalized text
12
  static process(text: string): string {
13
    // Safe to use code point iteration here
14
    return [...text].map((c) => c.toUpperCase()).join("")
15
  }
16

17
  // Ensure well-formed output
18
  static emit(text: string): string {
19
    return text.toWellFormed() // Replace any lone surrogates
20
  }
21
}

Conclusion

JavaScript’s string.length returning UTF-16 code units is a 1995 design decision that predates Unicode’s expansion. The language preserved backward compatibility at the cost of intuitive character counting.

The fix isn’t to “repair” length—it’s to use the right abstraction layer for each task:

User-facing operations (counting, truncation, cursor positioning): Intl.Segmenter
Protocol/encoding work (surrogate handling, regex matching): Code point APIs ([...str], /u flag)
Byte-level operations (buffer sizing, wire format): Code units (length, charCodeAt)

ES2024’s isWellFormed() and toWellFormed() close the gap for interoperability with systems that require valid UTF-16. Combined with Intl.Segmenter, JavaScript now has a complete Unicode story—you just need to know which tool to reach for.

Appendix

Prerequisites

Basic JavaScript string operations
Understanding of what Unicode is (character sets vs encodings)

Terminology

BMP (Basic Multilingual Plane): Unicode code points U+0000 to U+FFFF; require 1 UTF-16 code unit
Code point: A Unicode character’s numerical value (U+0000 to U+10FFFF)
Code unit: A fixed-size encoding unit (16 bits for UTF-16, 8 bits for UTF-8)
Grapheme cluster: A user-perceived character, potentially composed of multiple code points (defined by UAX #29)
Surrogate pair: Two UTF-16 code units encoding a supplementary plane character (U+10000+)
ZWJ (Zero-Width Joiner): U+200D; combines emoji into sequences like 👨‍👩‍👧‍👦

Summary

string.length returns UTF-16 code units, not characters or graphemes
Supplementary characters (emoji, rare CJK) require 2 code units (surrogate pairs)
ZWJ sequences combine multiple code points into single visual characters
Use Intl.Segmenter for grapheme-aware operations (Baseline April 2024)
Use [...str] or /u flag for code point-level operations
ES2024 added isWellFormed() and toWellFormed() for surrogate validation
Always normalize user input with normalize("NFC") before storage/comparison

References

ECMA-262 §6.1.4: The String Type - Specification defining strings as 16-bit code unit sequences
ECMA-402: Intl.Segmenter - Internationalization API specification for text segmentation
UAX #29: Unicode Text Segmentation - Unicode Standard Annex defining grapheme cluster boundaries
JavaScript’s internal character encoding: UCS-2 or UTF-16? - Mathias Bynens - Historical analysis of JavaScript’s encoding choice
String.prototype.isWellFormed - MDN - ES2024 well-formed string detection
Intl.Segmenter - MDN - API documentation and browser support
Intl.Segmenter is now part of Baseline - web.dev - Browser support announcement (April 2024)
MySQL 8.0: When to use utf8mb3 over utf8mb4 - Database charset guidance

Read more