7 min read β€’ β€’
Part of Series: JavaScript Fundamentals
  1. Deconstructing JavaScript String Length: Unicode, UTF-16, and the Grapheme Cluster
  2. A Comprehensive Analysis of Error Handling Paradigms in Modern JavaScript

Deconstructing JavaScript String Length: Unicode, UTF-16, and the Grapheme Cluster

When 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦'.length returns 11 instead of 1, it reveals a fundamental misalignment between developer intuition and the computer’s representation of text. This isn’t a JavaScript quirkβ€”it’s a window into the complex history of character encoding, from the economic constraints of 1960s teleprinters to the global demands of modern computing.

Image with different non-text icons

Photo by Maria Cappelli on Unsplash

JavaScript’s string.length property counts UTF-16 code units, not user-perceived characters. Modern Unicode textβ€”especially emoji and combining charactersβ€”requires multiple code units per visual character. Use Intl.Segmenter for grapheme-aware operations.

console.log("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length) // 11 - UTF-16 code units
console.log(getGraphemeLength("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦")) // 1 - User-perceived characters

The JavaScript string .length property operates at the lowest level of text abstractionβ€”UTF-16 code units. What developers perceive as a single character is often a complex composition of multiple code units.

const logLengths = (...items) => console.log(items.map((item) => `${item}: ${item.length}`))
// Basic characters work as expected
logLengths("A", "a", "Γ€", "⇐", "β‡Ÿ")
// ['A: 1', 'a: 1', 'Γ€: 1', '⇐: 1', 'β‡Ÿ: 1']
// Emoji require multiple code units
logLengths("🧘", "🌦", "πŸ˜‚", "πŸ˜ƒ", "πŸ₯–", "πŸš—")
// ['🧘: 2', '🌦: 2', 'πŸ˜‚: 2', 'πŸ˜ƒ: 2', 'πŸ₯–: 2', 'πŸš—: 2']
// Complex emoji sequences are even longer
logLengths("🧘", "πŸ§˜πŸ»β€β™‚οΈ", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦")
// ['🧘: 2', 'πŸ§˜πŸ»β€β™‚οΈ: 7', 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦: 11']

The Intl.Segmenter API provides the correct abstraction for user-perceived characters (grapheme clusters):

function getGraphemeLength(str, locale = "en") {
return [...new Intl.Segmenter(locale, { granularity: "grapheme" }).segment(str)].length
}
console.log("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length) // 11
console.log(getGraphemeLength("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦")) // 1
// Iterate over grapheme clusters
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
for (const grapheme of segmenter.segment("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸŒ¦οΈπŸ§˜πŸ»β€β™‚οΈ")) {
console.log(`'${grapheme.segment}' at index ${grapheme.index}`)
}
// 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦' at index 0
// '🌦️' at index 11
// 'πŸ§˜πŸ»β€β™‚οΈ' at index 14

ASCII emerged from the economic constraints of 1960s computing. Teleprinters were expensive, and data transmission costs were significant. The 7-bit design (128 characters) was a deliberate trade-off:

  • 95 printable characters: English letters, digits, punctuation
  • 33 control characters: Device instructions (carriage return, line feed)
  • Economic constraint: 8-bit would double transmission costs
// ASCII characters (U+0000 to U+007F) are single UTF-16 code units
"A".charCodeAt(0) // 65 (U+0041)
"a".charCodeAt(0) // 97 (U+0061)

ASCII’s 128-character limit proved inadequate for global use. This led to hundreds of incompatible 8-bit β€œExtended ASCII” encodings:

  • IBM Code Pages: CP437 (North America), CP850 (Western Europe)
  • ISO 8859 series: ISO-8859-1 (Latin-1), ISO-8859-5 (Cyrillic)
  • Vendor-specific: Windows-1252, Mac OS Roman

The result was mojibakeβ€”garbled text when documents crossed encoding boundaries.

Unicode introduced a fundamental separation between abstract characters and their byte representations:

  • Character Set: Abstract code points (U+0000 to U+10FFFF)
  • Encoding: Concrete byte representations (UTF-8, UTF-16, UTF-32)
// Unicode code points vs. encoding
"€".codePointAt(0) // 8364 (U+20AC)
"πŸ’©".codePointAt(0) // 128169 (U+1F4A9)

Unicode organizes its 1,114,112 code points into 17 planes:

PlaneRangeNameContents
0U+0000–U+FFFFBasic Multilingual Plane (BMP)Most modern scripts (Latin, Cyrillic, Greek, Arabic, CJK), symbols, punctuation
1U+10000–U+1FFFFSupplementary Multilingual Plane (SMP)Historic scripts (Linear B, Egyptian Hieroglyphs), musical notation, mathematical symbols, and most emoji
2U+20000–U+2FFFFSupplementary Ideographic Plane (SIP)Additional, less common, and historic CJK Unified Ideographs
3U+30000–U+3FFFFTertiary Ideographic Plane (TIP)Additional historic CJK Unified Ideographs, Oracle Bone script
4–13U+40000–U+DFFFFUnassignedReserved for future use
14U+E0000–U+EFFFFSupplementary Special-purpose Plane (SSP)Non-graphical characters, such as language tags and variation selectors
15–16U+F0000–U+10FFFFSupplementary Private Use Area (SPUA-A/B)Available for private use by applications and vendors; not standardized

Each encoding uses fixed-size code units:

  • UTF-8: 8-bit code units (1-4 bytes per character)
  • UTF-16: 16-bit code units (1-2 code units per character)
  • UTF-32: 32-bit code units (1 code unit per character)
// UTF-16 encoding examples
"€".length // 1 (BMP character)
"πŸ’©".length // 2 (supplementary plane - surrogate pair)
"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length // 11 (complex grapheme cluster)

JavaScript’s string representation is a historical artifact from the UCS-2 era (1995). When Unicode expanded beyond 16 bits, JavaScript maintained backward compatibility by adopting UTF-16’s surrogate pair mechanism.

Supplementary plane characters (U+10000 to U+10FFFF) are encoded using surrogate pairs:

// Surrogate pair encoding for U+1F4A9 (πŸ’©)
const highSurrogate = 0xd83d // U+D800 to U+DBFF
const lowSurrogate = 0xdca9 // U+DC00 to U+DFFF
// Mathematical transformation
const codePoint = 0x1f4a9
const temp = codePoint - 0x10000
const high = Math.floor(temp / 0x400) + 0xd800
const low = (temp % 0x400) + 0xdc00
console.log(high.toString(16), low.toString(16)) // 'd83d', 'dca9'

JavaScript’s core string methods operate on code units, not code points:

const emoji = "πŸ’©"
// Unsafe operations
emoji.length // 2 (code units)
emoji.charAt(0) // '\uD83D' (incomplete surrogate)
emoji.charCodeAt(0) // 55357 (high surrogate only)
// Safe operations
emoji.codePointAt(0) // 128169 (full code point)
[...emoji].length // 1 (code points)

ES6+ provides code point-aware iteration:

const text = "AπŸ’©Z"
// Unsafe: iterates over code units
for (let i = 0; i < text.length; i++) {
console.log(text[i]) // 'A', '\uD83D', '\uDCA9', 'Z'
}
// Safe: iterates over code points
for (const char of text) {
console.log(char) // 'A', 'πŸ’©', 'Z'
}
// Spread operator also works
console.log([...text]) // ['A', 'πŸ’©', 'Z']

For user-perceived characters, use Intl.Segmenter:

const family = "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
// Count grapheme clusters
console.log([...segmenter.segment(family)].length) // 1
// Iterate over grapheme clusters
for (const grapheme of segmenter.segment(family)) {
console.log(grapheme.segment) // 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦'
}

The u flag enables Unicode-aware regex:

// Without u flag: matches code units
/^.$/.test("πŸ’©") // false (2 code units)
// With u flag: matches code points
/^.$/u.test("πŸ’©") // true (1 code point)
// Unicode property escapes
/\p{Emoji}/u.test("πŸ’©") // true
/\p{Script=Latin}/u.test("A") // true

Handle different representations of the same character:

const e1 = "Γ©" // U+00E9 (precomposed)
const e2 = "e\u0301" // U+0065 + U+0301 (decomposed)
console.log(e1 === e2) // false
console.log(e1.normalize() === e2.normalize()) // true

MySQL’s legacy utf8 charset only supports 3 bytes per character, excluding supplementary plane characters:

-- Legacy (incomplete UTF-8)
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8);
-- Modern (complete UTF-8)
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8mb4);
  1. Explicit Encoding: Always specify UTF-8 in Content-Type headers
  2. Server-Side Normalization: Normalize all input to canonical form
  3. Opaque Strings: Don’t expose internal character representations
// API response with explicit encoding
res.setHeader("Content-Type", "application/json; charset=utf-8")
// Input normalization
const normalizedInput = userInput.normalize("NFC")
const emoji = "πŸ’©"
// Dangerous: splits surrogate pair
const corrupted = emoji.substring(0, 1) // '\uD83D' (invalid)
// Safe: use code point-aware methods
const safe = [...emoji][0] // 'πŸ’©'
// Dangerous: assumes 1 byte per character
const buffer = Buffer.alloc(100)
buffer.write(text.slice(0, 100)) // May overflow with emoji
// Safe: use proper encoding
const safeBuffer = Buffer.from(text, "utf8")
// Cyrillic 'Π°' vs Latin 'a'
const cyrillicA = "Π°" // U+0430
const latinA = "a" // U+0061
console.log(cyrillicA === latinA) // false
console.log(cyrillicA.normalize() === latinA.normalize()) // false
class UnicodeSanctuary {
// Decode on input
static decode(input: Buffer, encoding: string = "utf8"): string {
return input.toString(encoding).normalize("NFC")
}
// Process internally (always Unicode)
static process(text: string): string {
// All operations on normalized Unicode
return text.toUpperCase()
}
// Encode on output
static encode(text: string, encoding: string = "utf8"): Buffer {
return Buffer.from(text, encoding)
}
}
function validateUsername(username: string): boolean {
// Normalize first
const normalized = username.normalize("NFC")
// Check for homographs
const hasHomographs = /[\u0430-\u044F]/.test(normalized) // Cyrillic
// Validate grapheme length
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
const graphemeCount = [...segmenter.segment(normalized)].length
return !hasHomographs && graphemeCount >= 3 && graphemeCount <= 20
}

JavaScript’s string length behavior isn’t a flawβ€”it’s a historical artifact reflecting the evolution of character encoding standards. Understanding this history is essential for building robust, globally-compatible applications.

The key insights:

  1. Abstraction Layers: Characters exist at multiple levels (grapheme clusters, code points, code units)
  2. Historical Context: JavaScript’s UTF-16 choice reflects 1990s industry assumptions
  3. Modern Solutions: Use Intl.Segmenter for grapheme-aware operations
  4. Full-Stack Awareness: Unicode considerations extend beyond the browser

For expert developers, mastery of Unicode is no longer optional. In a globalized world, the ability to handle every script, symbol, and emoji correctly is fundamental to building secure, reliable software.

Tags

Read more

  • Previous

    LRU Cache: From Classic Implementation to Modern Alternatives

    16 min read β€’

    Caching is the unsung hero of high-performance applications. When implemented correctly, it can dramatically reduce latency, ease database load, and create a snappy, responsive user experience. Statistics show that even a one-second delay can cut conversions by 7%. For decades, the go-to solution for developers has been the Least Recently Used (LRU) cache, a simple yet effective strategy for keeping frequently used data close at hand.But what happens when this trusty tool fails? While LRU is a powerful default, it has a critical flaw that can cripple performance under common workloads. This vulnerability has spurred decades of research, leading to a new generation of smarter, more resilient caching algorithms that build upon LRU’s foundation.This guide will take you on a journey from the classic LRU cache implementation to understanding its limitations and exploring modern alternatives. We’ll dive deep into LRU’s inner workings, examine when it fails, and discover how advanced algorithms like LRU-K, 2Q, and ARC address these shortcomings.

  • Next in series: JavaScript Fundamentals

    A Comprehensive Analysis of Error Handling Paradigms in Modern JavaScript

    21 min read β€’ β€’

    From Exceptions to Values and Beyond - A Deep Dive into Architectural Trade-offs