7 min read

JavaScript String Length and Unicode

Understand why 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦'.length returns 11 instead of 1, and learn how to properly handle Unicode characters, grapheme clusters, and international text in JavaScript applications.

Image with different non-text icons

Photo by Maria Cappelli on Unsplash

JavaScript’s string.length property counts UTF-16 code units, not user-perceived characters. Modern Unicode textβ€”especially emoji and combining charactersβ€”requires multiple code units per visual character. Use Intl.Segmenter for grapheme-aware operations.

console.log("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length) // 11 - UTF-16 code units
console.log(getGraphemeLength("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦")) // 1 - User-perceived characters

The JavaScript string .length property operates at the lowest level of text abstractionβ€”UTF-16 code units. What developers perceive as a single character is often a complex composition of multiple code units.

const logLengths = (...items) => console.log(items.map((item) => `${item}: ${item.length}`))
// Basic characters work as expected
logLengths("A", "a", "Γ€", "⇐", "β‡Ÿ")
// ['A: 1', 'a: 1', 'Γ€: 1', '⇐: 1', 'β‡Ÿ: 1']
// Emoji require multiple code units
logLengths("🧘", "🌦", "πŸ˜‚", "πŸ˜ƒ", "πŸ₯–", "πŸš—")
// ['🧘: 2', '🌦: 2', 'πŸ˜‚: 2', 'πŸ˜ƒ: 2', 'πŸ₯–: 2', 'πŸš—: 2']
// Complex emoji sequences are even longer
logLengths("🧘", "πŸ§˜πŸ»β€β™‚οΈ", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦")
// ['🧘: 2', 'πŸ§˜πŸ»β€β™‚οΈ: 7', 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦: 11']

The Intl.Segmenter API provides the correct abstraction for user-perceived characters (grapheme clusters):

function getGraphemeLength(str, locale = "en") {
return [...new Intl.Segmenter(locale, { granularity: "grapheme" }).segment(str)].length
}
console.log("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length) // 11
console.log(getGraphemeLength("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦")) // 1
// Iterate over grapheme clusters
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
for (const grapheme of segmenter.segment("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸŒ¦οΈπŸ§˜πŸ»β€β™‚οΈ")) {
console.log(`'${grapheme.segment}' at index ${grapheme.index}`)
}
// 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦' at index 0
// '🌦️' at index 11
// 'πŸ§˜πŸ»β€β™‚οΈ' at index 14

ASCII emerged from the economic constraints of 1960s computing. Teleprinters were expensive, and data transmission costs were significant. The 7-bit design (128 characters) was a deliberate trade-off:

  • 95 printable characters: English letters, digits, punctuation
  • 33 control characters: Device instructions (carriage return, line feed)
  • Economic constraint: 8-bit would double transmission costs
// ASCII characters (U+0000 to U+007F) are single UTF-16 code units
"A".charCodeAt(0) // 65 (U+0041)
"a".charCodeAt(0) // 97 (U+0061)

ASCII’s 128-character limit proved inadequate for global use. This led to hundreds of incompatible 8-bit β€œExtended ASCII” encodings:

  • IBM Code Pages: CP437 (North America), CP850 (Western Europe)
  • ISO 8859 series: ISO-8859-1 (Latin-1), ISO-8859-5 (Cyrillic)
  • Vendor-specific: Windows-1252, Mac OS Roman

The result was mojibakeβ€”garbled text when documents crossed encoding boundaries.

Unicode introduced a fundamental separation between abstract characters and their byte representations:

  • Character Set: Abstract code points (U+0000 to U+10FFFF)
  • Encoding: Concrete byte representations (UTF-8, UTF-16, UTF-32)
// Unicode code points vs. encoding
"€".codePointAt(0) // 8364 (U+20AC)
"πŸ’©".codePointAt(0) // 128169 (U+1F4A9)

Unicode organizes its 1,114,112 code points into 17 planes:

PlaneRangeNameContents
0U+0000–U+FFFFBasic Multilingual Plane (BMP)Most modern scripts (Latin, Cyrillic, Greek, Arabic, CJK), symbols, punctuation
1U+10000–U+1FFFFSupplementary Multilingual Plane (SMP)Historic scripts (Linear B, Egyptian Hieroglyphs), musical notation, mathematical symbols, and most emoji
2U+20000–U+2FFFFSupplementary Ideographic Plane (SIP)Additional, less common, and historic CJK Unified Ideographs
3U+30000–U+3FFFFTertiary Ideographic Plane (TIP)Additional historic CJK Unified Ideographs, Oracle Bone script
4–13U+40000–U+DFFFFUnassignedReserved for future use
14U+E0000–U+EFFFFSupplementary Special-purpose Plane (SSP)Non-graphical characters, such as language tags and variation selectors
15–16U+F0000–U+10FFFFSupplementary Private Use Area (SPUA-A/B)Available for private use by applications and vendors; not standardized

Each encoding uses fixed-size code units:

  • UTF-8: 8-bit code units (1-4 bytes per character)
  • UTF-16: 16-bit code units (1-2 code units per character)
  • UTF-32: 32-bit code units (1 code unit per character)
// UTF-16 encoding examples
"€".length // 1 (BMP character)
"πŸ’©".length // 2 (supplementary plane - surrogate pair)
"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length // 11 (complex grapheme cluster)

JavaScript’s string representation is a historical artifact from the UCS-2 era (1995). When Unicode expanded beyond 16 bits, JavaScript maintained backward compatibility by adopting UTF-16’s surrogate pair mechanism.

Supplementary plane characters (U+10000 to U+10FFFF) are encoded using surrogate pairs:

// Surrogate pair encoding for U+1F4A9 (πŸ’©)
const highSurrogate = 0xd83d // U+D800 to U+DBFF
const lowSurrogate = 0xdca9 // U+DC00 to U+DFFF
// Mathematical transformation
const codePoint = 0x1f4a9
const temp = codePoint - 0x10000
const high = Math.floor(temp / 0x400) + 0xd800
const low = (temp % 0x400) + 0xdc00
console.log(high.toString(16), low.toString(16)) // 'd83d', 'dca9'

JavaScript’s core string methods operate on code units, not code points:

const emoji = "πŸ’©"
// Unsafe operations
emoji.length // 2 (code units)
emoji.charAt(0) // '\uD83D' (incomplete surrogate)
emoji.charCodeAt(0) // 55357 (high surrogate only)
// Safe operations
emoji.codePointAt(0) // 128169 (full code point)
[...emoji].length // 1 (code points)

ES6+ provides code point-aware iteration:

const text = "AπŸ’©Z"
// Unsafe: iterates over code units
for (let i = 0; i < text.length; i++) {
console.log(text[i]) // 'A', '\uD83D', '\uDCA9', 'Z'
}
// Safe: iterates over code points
for (const char of text) {
console.log(char) // 'A', 'πŸ’©', 'Z'
}
// Spread operator also works
console.log([...text]) // ['A', 'πŸ’©', 'Z']

For user-perceived characters, use Intl.Segmenter:

const family = "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
// Count grapheme clusters
console.log([...segmenter.segment(family)].length) // 1
// Iterate over grapheme clusters
for (const grapheme of segmenter.segment(family)) {
console.log(grapheme.segment) // 'πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦'
}

The u flag enables Unicode-aware regex:

// Without u flag: matches code units
/^.$/.test("πŸ’©") // false (2 code units)
// With u flag: matches code points
/^.$/u.test("πŸ’©") // true (1 code point)
// Unicode property escapes
/\p{Emoji}/u.test("πŸ’©") // true
/\p{Script=Latin}/u.test("A") // true

Handle different representations of the same character:

const e1 = "Γ©" // U+00E9 (precomposed)
const e2 = "e\u0301" // U+0065 + U+0301 (decomposed)
console.log(e1 === e2) // false
console.log(e1.normalize() === e2.normalize()) // true

MySQL’s legacy utf8 charset only supports 3 bytes per character, excluding supplementary plane characters:

-- Legacy (incomplete UTF-8)
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8);
-- Modern (complete UTF-8)
CREATE TABLE users (name VARCHAR(255) CHARACTER SET utf8mb4);
  1. Explicit Encoding: Always specify UTF-8 in Content-Type headers
  2. Server-Side Normalization: Normalize all input to canonical form
  3. Opaque Strings: Don’t expose internal character representations
// API response with explicit encoding
res.setHeader("Content-Type", "application/json; charset=utf-8")
// Input normalization
const normalizedInput = userInput.normalize("NFC")
const emoji = "πŸ’©"
// Dangerous: splits surrogate pair
const corrupted = emoji.substring(0, 1) // '\uD83D' (invalid)
// Safe: use code point-aware methods
const safe = [...emoji][0] // 'πŸ’©'
// Dangerous: assumes 1 byte per character
const buffer = Buffer.alloc(100)
buffer.write(text.slice(0, 100)) // May overflow with emoji
// Safe: use proper encoding
const safeBuffer = Buffer.from(text, "utf8")
// Cyrillic 'Π°' vs Latin 'a'
const cyrillicA = "Π°" // U+0430
const latinA = "a" // U+0061
console.log(cyrillicA === latinA) // false
console.log(cyrillicA.normalize() === latinA.normalize()) // false
class UnicodeSanctuary {
// Decode on input
static decode(input: Buffer, encoding: string = "utf8"): string {
return input.toString(encoding).normalize("NFC")
}
// Process internally (always Unicode)
static process(text: string): string {
// All operations on normalized Unicode
return text.toUpperCase()
}
// Encode on output
static encode(text: string, encoding: string = "utf8"): Buffer {
return Buffer.from(text, encoding)
}
}
function validateUsername(username: string): boolean {
// Normalize first
const normalized = username.normalize("NFC")
// Check for homographs
const hasHomographs = /[\u0430-\u044F]/.test(normalized) // Cyrillic
// Validate grapheme length
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" })
const graphemeCount = [...segmenter.segment(normalized)].length
return !hasHomographs && graphemeCount >= 3 && graphemeCount <= 20
}

JavaScript’s string length behavior isn’t a flawβ€”it’s a historical artifact reflecting the evolution of character encoding standards. Understanding this history is essential for building robust, globally-compatible applications.

The key insights:

  1. Abstraction Layers: Characters exist at multiple levels (grapheme clusters, code points, code units)
  2. Historical Context: JavaScript’s UTF-16 choice reflects 1990s industry assumptions
  3. Modern Solutions: Use Intl.Segmenter for grapheme-aware operations
  4. Full-Stack Awareness: Unicode considerations extend beyond the browser

For expert developers, mastery of Unicode is no longer optional. In a globalized world, the ability to handle every script, symbol, and emoji correctly is fundamental to building secure, reliable software.

Tags

Read more

  • Next

    Exponential Backoff and Retry Strategies

    Programming 14 min read

    Learn how to build resilient distributed systems using exponential backoff, jitter, and modern retry strategies to handle transient failures and prevent cascading outages.I. Introduction: Beyond Naive RetriesII. The Mechanics of Exponential BackoffIII. Preventing Correlated Failures with JitterIV. Production-Ready ImplementationV. The Broader Resilience EcosystemVI. Operationalizing BackoffVII. Learning from Real-World FailuresVIII. Conclusion

  • Previous

    Error Handling Paradigms in JavaScript

    Programming 21 min read

    Master exception-based and value-based error handling approaches, from traditional try-catch patterns to modern functional programming techniques with monadic structures.IntroductionSection 1: The Orthodox Approach - Exceptions as Control FlowSection 2: The Paradigm Shift - Errors as Return ValuesSection 3: Implementing Monadic Patterns in PracticeSection 4: The Future of Ergonomic Error Handling4.1 The Pipeline Operator (|>): Streamlining Composition4.2 Pattern Matching: The Definitive Result Consumer4.3 The Try Operator (?): Native Result Types4.4 Supporting Syntax: do and throw ExpressionsSection 5: Synthesis and Recommendations