JavaScript String Length and Unicode

Understand why '👨‍👩‍👧‍👦'.length returns 11 instead of 1, and learn how to properly handle Unicode characters, grapheme clusters, and international text in JavaScript applications.

Image with different non-text icons — Photo by Maria Cappelli on Unsplash

TL;DR

JavaScript’s string.length property counts UTF-16 code units, not user-perceived characters. Modern Unicode text—especially emoji and combining characters—requires multiple code units per visual character. Use Intl.Segmenter for grapheme-aware operations.

1
console.log("👨‍👩‍👧‍👦".length) // 11 - UTF-16 code units
2
console.log(getGraphemeLength("👨‍👩‍👧‍👦")) // 1 - User-perceived characters

The Problem: What You See vs. What You Get
The Solution: Intl.Segmenter
The Historical Foundation: From ASCII to Unicode
Unicode Architecture: Planes and Code Units
- The 17 Unicode Planes
- Code Units: The Building Blocks
JavaScript’s UTF-16 Legacy
- Surrogate Pairs
- The Legacy API Problem
Modern Unicode-Aware JavaScript
Beyond JavaScript: Full-Stack Unicode Considerations
- Database Storage
- API Design Best Practices
Common Unicode-Related Bugs
Defensive Programming Strategies
- The Unicode Sanctuary Pattern
- Validation and Sanitization
Conclusion
References

The Problem: What You See vs. What You Get

The JavaScript string .length property operates at the lowest level of text abstraction—UTF-16 code units. What developers perceive as a single character is often a complex composition of multiple code units.

1
const logLengths = (...items) => console.log(items.map((item) => `${item}: ${item.length}`))
2

3
// Basic characters work as expected
4
logLengths("A", "a", "À", "⇐", "⇟")
5
// ['A: 1', 'a: 1', 'À: 1', '⇐: 1', '⇟: 1']
6

7
// Emoji require multiple code units
8
logLengths("🧘", "🌦", "😂", "😃", "🥖", "🚗")
9
// ['🧘: 2', '🌦: 2', '😂: 2', '😃: 2', '🥖: 2', '🚗: 2']
10

11
// Complex emoji sequences are even longer
12
logLengths("🧘", "🧘🏻‍♂️", "👨‍👩‍👧‍👦")
13
// ['🧘: 2', '🧘🏻‍♂️: 7', '👨‍👩‍👧‍👦: 11']

The Solution: Intl.Segmenter

The Intl.Segmenter API provides the correct abstraction for user-perceived characters (grapheme clusters):

1
function getGraphemeLength(str, locale = "en") {
2
  return [...new Intl.Segmenter(locale, { granularity: "grapheme" }).segment(str)].length
3
}
4

5
console.log("👨‍👩‍👧‍👦".length) // 11
6
console.log(getGraphemeLength("👨‍👩‍👧‍👦")) // 1
7

8
// Iterate over grapheme clusters
9
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
10
for (const grapheme of segmenter.segment("👨‍👩‍👧‍👦🌦️🧘🏻‍♂️")) {
11
  console.log(`'${grapheme.segment}' at index ${grapheme.index}`)
12
}
13
// '👨‍👩‍👧‍👦' at index 0
14
// '🌦️' at index 11
15
// '🧘🏻‍♂️' at index 14

The Historical Foundation: From ASCII to Unicode

The Age of ASCII (1960s)

ASCII emerged from the economic constraints of 1960s computing. Teleprinters were expensive, and data transmission costs were significant. The 7-bit design (128 characters) was a deliberate trade-off:

95 printable characters: English letters, digits, punctuation
33 control characters: Device instructions (carriage return, line feed)
Economic constraint: 8-bit would double transmission costs

1
// ASCII characters (U+0000 to U+007F) are single UTF-16 code units
2
"A".charCodeAt(0) // 65 (U+0041)
3
"a".charCodeAt(0) // 97 (U+0061)

The Extended ASCII Chaos

ASCII’s 128-character limit proved inadequate for global use. This led to hundreds of incompatible 8-bit “Extended ASCII” encodings:

IBM Code Pages: CP437 (North America), CP850 (Western Europe)
ISO 8859 series: ISO-8859-1 (Latin-1), ISO-8859-5 (Cyrillic)
Vendor-specific: Windows-1252, Mac OS Roman

The result was mojibake—garbled text when documents crossed encoding boundaries.

The Unicode Revolution

Unicode introduced a fundamental separation between abstract characters and their byte representations:

Character Set: Abstract code points (U+0000 to U+10FFFF)
Encoding: Concrete byte representations (UTF-8, UTF-16, UTF-32)

1
// Unicode code points vs. encoding
2
"€".codePointAt(0) // 8364 (U+20AC)
3
"💩".codePointAt(0) // 128169 (U+1F4A9)

Unicode Architecture: Planes and Code Units

The 17 Unicode Planes

Unicode organizes its 1,114,112 code points into 17 planes:

Plane	Range	Name	Contents
0	U+0000–U+FFFF	Basic Multilingual Plane (BMP)	Most modern scripts (Latin, Cyrillic, Greek, Arabic, CJK), symbols, punctuation
1	U+10000–U+1FFFF	Supplementary Multilingual Plane (SMP)	Historic scripts (Linear B, Egyptian Hieroglyphs), musical notation, mathematical symbols, and most emoji
2	U+20000–U+2FFFF	Supplementary Ideographic Plane (SIP)	Additional, less common, and historic CJK Unified Ideographs
3	U+30000–U+3FFFF	Tertiary Ideographic Plane (TIP)	Additional historic CJK Unified Ideographs, Oracle Bone script
4–13	U+40000–U+DFFFF	Unassigned	Reserved for future use
14	U+E0000–U+EFFFF	Supplementary Special-purpose Plane (SSP)	Non-graphical characters, such as language tags and variation selectors
15–16	U+F0000–U+10FFFF	Supplementary Private Use Area (SPUA-A/B)	Available for private use by applications and vendors; not standardized

Code Units: The Building Blocks

Each encoding uses fixed-size code units:

UTF-8: 8-bit code units (1-4 bytes per character)
UTF-16: 16-bit code units (1-2 code units per character)
UTF-32: 32-bit code units (1 code unit per character)

1
// UTF-16 encoding examples
2
"€".length // 1 (BMP character)
3
"💩".length // 2 (supplementary plane - surrogate pair)
4
"👨‍👩‍👧‍👦".length // 11 (complex grapheme cluster)

JavaScript’s UTF-16 Legacy

JavaScript’s string representation is a historical artifact from the UCS-2 era (1995). When Unicode expanded beyond 16 bits, JavaScript maintained backward compatibility by adopting UTF-16’s surrogate pair mechanism.

Surrogate Pairs

Supplementary plane characters (U+10000 to U+10FFFF) are encoded using surrogate pairs:

1
// Surrogate pair encoding for U+1F4A9 (💩)
2
const highSurrogate = 0xd83d // U+D800 to U+DBFF
3
const lowSurrogate = 0xdca9 // U+DC00 to U+DFFF
4

5
// Mathematical transformation
6
const codePoint = 0x1f4a9
7
const temp = codePoint - 0x10000
8
const high = Math.floor(temp / 0x400) + 0xd800
9
const low = (temp % 0x400) + 0xdc00
10

11
console.log(high.toString(16), low.toString(16)) // 'd83d', 'dca9'

The Legacy API Problem

JavaScript’s core string methods operate on code units, not code points:

1
const emoji = "💩"
2

3
// Unsafe operations
4
emoji.length // 2 (code units)
5
emoji.charAt(0) // '\uD83D' (incomplete surrogate)
6
emoji.charCodeAt(0) // 55357 (high surrogate only)
7

8
// Safe operations
9
emoji.codePointAt(0) // 128169 (full code point)
10
[...emoji].length // 1 (code points)

Modern Unicode-Aware JavaScript

Code Point Iteration

ES6+ provides code point-aware iteration:

1
const text = "A💩Z"
2

3
// Unsafe: iterates over code units
4
for (let i = 0; i < text.length; i++) {
5
  console.log(text[i]) // 'A', '\uD83D', '\uDCA9', 'Z'
6
}
7

8
// Safe: iterates over code points
9
for (const char of text) {
10
  console.log(char) // 'A', '💩', 'Z'
11
}
12

13
// Spread operator also works
14
console.log([...text]) // ['A', '💩', 'Z']

Grapheme Cluster Segmentation

For user-perceived characters, use Intl.Segmenter:

1
const family = "👨‍👩‍👧‍👦"
2
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
3

4
// Count grapheme clusters
5
console.log([...segmenter.segment(family)].length) // 1
6

7
// Iterate over grapheme clusters
8
for (const grapheme of segmenter.segment(family)) {
9
  console.log(grapheme.segment) // '👨‍👩‍👧‍👦'
10
}

Unicode-Aware Regular Expressions

The u flag enables Unicode-aware regex:

1
// Without u flag: matches code units
2
/^.$/.test("💩") // false (2 code units)
3

4
// With u flag: matches code points
5
/^.$/u.test("💩") // true (1 code point)
6

7
// Unicode property escapes
8
/\p{Emoji}/u.test("💩") // true
9
/\p{Script=Latin}/u.test("A") // true

String Normalization

Handle different representations of the same character:

1
const e1 = "é" // U+00E9 (precomposed)
2
const e2 = "e\u0301" // U+0065 + U+0301 (decomposed)
3

4
console.log(e1 === e2) // false
5
console.log(e1.normalize() === e2.normalize()) // true

Beyond JavaScript: Full-Stack Unicode Considerations

Database Storage

MySQL’s legacy utf8 charset only supports 3 bytes per character, excluding supplementary plane characters:

1
-- Legacy (incomplete UTF-8)
2
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8);
3

4
-- Modern (complete UTF-8)
5
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8mb4);

API Design Best Practices

Explicit Encoding: Always specify UTF-8 in Content-Type headers
Server-Side Normalization: Normalize all input to canonical form
Opaque Strings: Don’t expose internal character representations

1
// API response with explicit encoding
2
res.setHeader("Content-Type", "application/json; charset=utf-8")
3

4
// Input normalization
5
const normalizedInput = userInput.normalize("NFC")

Surrogate Pair Corruption

1
const emoji = "💩"
2

3
// Dangerous: splits surrogate pair
4
const corrupted = emoji.substring(0, 1) // '\uD83D' (invalid)
5

6
// Safe: use code point-aware methods
7
const safe = [...emoji][0] // '💩'

Buffer Overflow with Multi-byte Characters

1
// Dangerous: assumes 1 byte per character
2
const buffer = Buffer.alloc(100)
3
buffer.write(text.slice(0, 100)) // May overflow with emoji
4

5
// Safe: use proper encoding
6
const safeBuffer = Buffer.from(text, "utf8")

Visual Spoofing (Homograph Attacks)

1
// Cyrillic 'а' vs Latin 'a'
2
const cyrillicA = "а" // U+0430
3
const latinA = "a" // U+0061
4

5
console.log(cyrillicA === latinA) // false
6
console.log(cyrillicA.normalize() === latinA.normalize()) // false

Defensive Programming Strategies

The Unicode Sanctuary Pattern

1
class UnicodeSanctuary {
2
  // Decode on input
3
  static decode(input: Buffer, encoding: string = "utf8"): string {
4
    return input.toString(encoding).normalize("NFC")
5
  }
6

7
  // Process internally (always Unicode)
8
  static process(text: string): string {
9
    // All operations on normalized Unicode
10
    return text.toUpperCase()
11
  }
12

13
  // Encode on output
14
  static encode(text: string, encoding: string = "utf8"): Buffer {
15
    return Buffer.from(text, encoding)
16
  }
17
}

Validation and Sanitization

1
function validateUsername(username: string): boolean {
2
  // Normalize first
3
  const normalized = username.normalize("NFC")
4

5
  // Check for homographs
6
  const hasHomographs = /[\u0430-\u044F]/.test(normalized) // Cyrillic
7

8
  // Validate grapheme length
9
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
10
  const graphemeCount = [...segmenter.segment(normalized)].length
11

12
  return !hasHomographs && graphemeCount >= 3 && graphemeCount <= 20
13
}

Conclusion

JavaScript’s string length behavior isn’t a flaw—it’s a historical artifact reflecting the evolution of character encoding standards. Understanding this history is essential for building robust, globally-compatible applications.

The key insights:

Abstraction Layers: Characters exist at multiple levels (grapheme clusters, code points, code units)
Historical Context: JavaScript’s UTF-16 choice reflects 1990s industry assumptions
Modern Solutions: Use Intl.Segmenter for grapheme-aware operations
Full-Stack Awareness: Unicode considerations extend beyond the browser

For expert developers, mastery of Unicode is no longer optional. In a globalized world, the ability to handle every script, symbol, and emoji correctly is fundamental to building secure, reliable software.