Design Search Autocomplete: Prefix Matching at Scale

A system design for search autocomplete (typeahead) covering prefix data structures, ranking algorithms, distributed architecture, and sub-100ms latency requirements. This design addresses the challenge of returning relevant suggestions within the user’s typing cadence—typically under 100ms—while handling billions of queries daily.

High-level architecture: Client debouncing feeds through CDN cache to sharded trie services with pre-computed rankings. Indexing pipeline rebuilds tries from query logs.

Abstract

Search autocomplete is a prefix-completion problem where every millisecond matters—users expect suggestions before they finish typing. The core mental model:

Data structure choice determines latency floor: Tries provide O(p) prefix lookup where p is prefix length, independent of corpus size. Finite State Transducers (FST) compress this further for in-memory serving.
Pre-computed top-K eliminates runtime ranking: Store the top 5-10 suggestions at each trie node. Query becomes pure traversal, no sorting.
Freshness requires a dual-path architecture: Weekly batch rebuilds for stable rankings + streaming hot updates for trending queries.
Client-side debouncing is mandatory: Without 300ms debounce, typing “javascript” generates 10 requests in 1 second. With debounce: 1 request.
Personalization adds latency: Generic suggestions serve from cache in <5ms. Personalized suggestions require user context lookup, adding 10-50ms.

The fundamental tradeoff: latency vs. relevance. Pre-computed suggestions are fast but stale. Real-time ranking is relevant but slow. Production systems layer both.

Requirements

Functional Requirements

Requirement	Priority	Notes
Return suggestions for partial query	Core	Primary feature
Rank by relevance (popularity, freshness)	Core	Not just alphabetical
Support trending/breaking queries	Core	News events, viral content
Personalized suggestions	Extended	Based on user history
Spell correction / fuzzy matching	Extended	Handle typos
Multi-language support	Extended	Unicode, RTL scripts

Out of scope: Full document search (separate system), voice input, image search.

Non-Functional Requirements

Requirement	Target	Rationale
Latency	P99 < 100ms	User typing cadence ~150ms between keystrokes
Availability	99.99%	User-facing, affects search engagement
Throughput	500K QPS peak	Based on scale estimation below
Suggestion freshness	< 1 hour for trending	Breaking news must surface quickly
Consistency	Eventual (< 5 min)	Acceptable for suggestions

Scale Estimation

Assumptions (Google-scale reference):

Daily Active Users (DAU): 1 billion
Searches per user per day: 5
Characters per search: 20 average
Autocomplete triggers: Every 2-3 characters after minimum prefix

Traffic calculation:

1
Queries/day = 1B users × 5 searches × (20 chars / 3 chars per trigger)
2
            = 1B × 5 × 7 triggers
3
            = 35 billion autocomplete requests/day
4

5
QPS average = 35B / 86,400 = ~405K QPS
6
QPS peak (3x) = ~1.2M QPS

Storage calculation:

1
Unique queries to index: ~1 billion (estimated from query logs)
2
Average query length: 25 bytes
3
Metadata per query (score, timestamp): 16 bytes
4
Raw storage: 1B × 41 bytes = 41 GB
5

6
Trie overhead (pointers, node structure): 3-5x
7
Trie storage: ~150-200 GB per replica

Bandwidth:

1
Request size: ~50 bytes (prefix + metadata)
2
Response size: ~500 bytes (10 suggestions with metadata)
3

4
Ingress: 1.2M QPS × 50 bytes = 60 MB/s
5
Egress: 1.2M QPS × 500 bytes = 600 MB/s

Design Paths

Path A: Trie-Based with Pre-computed Rankings

Best when:

Suggestion corpus is bounded (millions, not billions of unique queries)
Latency is critical (<50ms P99)
Ranking signals are relatively stable (popularity-based)

Architecture:

In-memory tries sharded by prefix range
Top-K suggestions pre-computed at each node during indexing
Weekly full rebuilds + hourly delta updates

Key characteristics:

Query is pure traversal: O(p) where p = prefix length
No runtime ranking computation
Memory-intensive: entire trie must fit in RAM

Trade-offs:

✅ Sub-10ms query latency achievable
✅ Predictable, consistent performance
✅ Simple query path (no scoring logic)
❌ High memory footprint (~200GB per shard replica)
❌ Freshness limited by rebuild frequency
❌ Personalization requires separate lookup

Real-world example: LinkedIn’s Cleo serves generic typeahead in <1ms using pre-computed tries. Network-personalized suggestions take 15ms due to additional context lookups.

Path B: Inverted Index with Completion Suggester

Best when:

Corpus is large and dynamic (e-commerce catalogs, document search)
Need flexibility for different query types (prefix, fuzzy, phrase)
Already running Elasticsearch/OpenSearch infrastructure

Architecture:

Elasticsearch completion suggester using FST (Finite State Transducers)
Edge n-gram tokenization for flexible matching
Real-time indexing via Kafka

Key characteristics:

FST provides compact in-memory representation
Supports fuzzy matching and context filtering
Index updates are near real-time

Trade-offs:

✅ Flexible query types (prefix, infix, fuzzy)
✅ Real-time updates without full rebuild
✅ Built-in sharding and replication
❌ Higher latency than pure trie (10-50ms typical)
❌ Index size 15-17x larger with edge n-gram analyzer
❌ Operational complexity of Elasticsearch cluster

Real-world example: Amazon product search uses inverted indexes with extensive metadata (ratings, sales rank) for ranking. Handles dynamic catalog updates in near real-time.

Path Comparison

Factor	Path A: Trie	Path B: Inverted Index
Query latency	<10ms	10-50ms
Memory efficiency	Lower (pointer overhead)	Higher (FST compression)
Update latency	Hours (batch)	Seconds (streaming)
Fuzzy matching	Requires separate structure	Native support
Sharding complexity	Manual prefix-based	Built-in (Elasticsearch)
Operational overhead	Custom infrastructure	Managed service available
Best for	High-QPS generic suggestions	Dynamic catalogs, flexible queries

This Article’s Focus

This article focuses on Path A (Trie-Based) because:

It represents the canonical autocomplete architecture used by Google, Twitter, and LinkedIn for their primary suggestion services
It demonstrates fundamental prefix data structures that underpin even inverted-index implementations
Sub-10ms latency is achievable, which Path B cannot match

Path B implementation details are covered in the “Elasticsearch Alternative” section under Low-Level Design.

High-Level Design

Component Overview

Component	Responsibility	Technology
API Gateway	Rate limiting, authentication, routing	Kong, AWS API Gateway
Shard Router	Route prefix to correct trie shard	Custom service
Trie Service	Serve suggestions from in-memory trie	Custom service (Go/Rust)
Ranking Service	Re-rank with personalization signals	Custom service
Redis Cache	Cache hot prefixes and user history	Redis Cluster
Trie Builder	Build/update tries from query logs	Spark/Flink
Kafka	Stream query logs and trending signals	Apache Kafka
Object Storage	Store serialized trie snapshots	S3, HDFS

Request Flow

Sharding Strategy

Prefix-based sharding routes queries by first character(s):

1
Shard 1: a-f (covers ~25% of queries)
2
Shard 2: g-m (covers ~20% of queries)
3
Shard 3: n-s (covers ~35% of queries - 's' is most common)
4
Shard 4: t-z (covers ~20% of queries)

Why prefix-based over hash-based:

Data locality: Related queries (“system”, “systems”, “systematic”) on same shard
Prefix routing: Query “sys” deterministically routes to shard 3
Range queries: Can aggregate suggestions across prefix ranges

Handling hotspots:

English letter frequency is uneven. The prefix “s” alone accounts for ~10% of queries. Solutions:

Finer granularity: Split “s” into “sa-se”, “sf-sm”, “sn-sz”
Dynamic rebalancing: Shard Map Manager monitors load and adjusts ranges
Replication: More replicas for hot shards

API Design

Suggestion Endpoint

1
GET /api/v1/suggestions?q={prefix}&limit={n}&lang={code}

Request Parameters:

Parameter	Type	Required	Default	Description
`q`	string	Yes	-	Query prefix (min 2 chars)
`limit`	int	No	10	Max suggestions (1-20)
`lang`	string	No	en	Language code (ISO 639-1)
`user_id`	string	No	-	For personalized suggestions
`context`	string	No	-	Search context (web, images, news)

Response (200 OK):

1
{
2
  "query": "prog",
3
  "suggestions": [
4
    {
5
      "text": "programming",
6
      "score": 0.95,
7
      "type": "query",
8
      "metadata": {
9
        "category": "technology",
10
        "trending": false
11
      }
12
    },
13
    {
14
      "text": "progress",
15
      "score": 0.87,
16
      "type": "query",
17
      "metadata": {
18
        "category": "general",
19
        "trending": false
20
      }
21
    },
22
    {
23
      "text": "program download",
24
      "score": 0.82,
25
      "type": "query",
26
      "metadata": {
27
        "category": "technology",
28
        "trending": true
29
      }
30
    }
31
  ],
32
  "took_ms": 8,
33
  "cache_hit": false
34
}

Error Responses:

Status	Condition	Response
400	Prefix too short (<2 chars)	`{"error": "prefix_too_short", "min_length": 2}`
429	Rate limit exceeded	`{"error": "rate_limited", "retry_after": 60}`
503	Service overloaded	`{"error": "service_unavailable"}`

Rate Limits:

Anonymous: 100 requests/minute per IP
Authenticated: 1000 requests/minute per user
Burst: 20 requests/second max

Response Optimization

Compression: Enable gzip for responses >1KB. Typical 10-suggestion response compresses from 500 bytes to ~200 bytes.

Cache Headers:

1
Cache-Control: public, max-age=300
2
ETag: "a1b2c3d4"
3
Vary: Accept-Encoding, Accept-Language

Pagination: Not applicable—autocomplete returns bounded set (≤20). If more results needed, user should submit full search.

Data Modeling

Trie Node Structure

1
interface TrieNode {
2
  children: Map<string, TrieNode> // Character -> child node
3
  isEndOfWord: boolean
4
  topSuggestions: Suggestion[] // Pre-computed top-K
5
  frequency: number // Aggregate frequency for this prefix
6
}
7

8
interface Suggestion {
9
  text: string // Full query text
10
  score: number // Normalized relevance score [0, 1]
11
  frequency: number // Raw query count
12
  lastUpdated: number // Unix timestamp
13
  trending: boolean // Recently spiking
14
  metadata: {
15
    category?: string
16
    language: string
17
  }
18
}

Storage Schema

Query Log (Kafka topic: query-logs):

1
{
2
  "query": "programming tutorials",
3
  "timestamp": 1706918400,
4
  "user_id": "u123", // Hashed, optional
5
  "session_id": "s456",
6
  "result_clicked": true,
7
  "position_clicked": 2,
8
  "locale": "en-US",
9
  "platform": "web"
10
}

Aggregated Query Stats (Redis Hash):

1
HSET query:programming tutorials
2
  frequency 1542389
3
  last_seen 1706918400
4
  trending 0
5
  category technology

Trie Snapshot (S3/HDFS):

1
s3://autocomplete-data/tries/
2
  └── 2024-02-03/
3
      ├── shard-a-f.trie.gz      (50GB compressed)
4
      ├── shard-g-m.trie.gz      (40GB compressed)
5
      ├── shard-n-s.trie.gz      (60GB compressed)
6
      ├── shard-t-z.trie.gz      (40GB compressed)
7
      └── manifest.json

Database Selection

Data	Store	Rationale
Live trie	In-memory (custom)	Sub-ms traversal required
Hot prefix cache	Redis Cluster	<1ms lookups, TTL support
Query logs	Kafka → S3	Streaming ingestion, durable storage
Trie snapshots	S3/HDFS	Large files, versioned, cross-region replication
User history	DynamoDB/Cassandra	Key-value access pattern, high write throughput
Trending signals	Redis Sorted Set	Real-time top-K with scores

Low-Level Design

Trie Implementation

Design decision: Hash map vs. array children

Approach	Lookup	Memory	Best for
Array[26]	O(1)	26 pointers/node	Dense tries, ASCII only
Array[128]	O(1)	128 pointers/node	Full ASCII
HashMap	O(1) avg	Variable	Sparse tries, Unicode

Chosen: HashMap because:

Unicode support required (multi-language)
Most nodes have <5 children (sparse)
Modern hash maps have near-constant lookup


8 collapsed lines
1
package autocomplete
2

3
import (
4
    "sort"
5
    "sync"
6
)
7

8
const TopK = 10
9

10
type TrieNode struct {
11
    children       map[rune]*TrieNode
12
    isEnd          bool
13
    topSuggestions []Suggestion
14
    mu             sync.RWMutex
15
}
16

17
type Suggestion struct {
18
    Text      string
19
    Score     float64
20
    Frequency int64
21
    Trending  bool
22
}
23

24
func (t *TrieNode) Search(prefix string) []Suggestion {
25
    t.mu.RLock()
26
    defer t.mu.RUnlock()
27

28
    node := t
29
    for _, char := range prefix {
30
        child, exists := node.children[char]
31
        if !exists {
32
            return nil // No suggestions for this prefix
33
        }
34
        node = child
35
    }
36
    // Return pre-computed top-K at this node
37
    return node.topSuggestions
38
}
39

40
func (t *TrieNode) Insert(word string, score float64) {
41
    t.mu.Lock()
42
    defer t.mu.Unlock()
43
    // ... insertion logic with top-K update propagation
44
}
16 collapsed lines
45

46
// BuildTopK propagates top-K suggestions up the trie
47
// Called during index build, not at query time
48
func (t *TrieNode) BuildTopK() {
49
    // Post-order traversal: build children first
50
    for _, child := range t.children {
51
        child.BuildTopK()
52
    }
53

54
    // Collect all suggestions from children + this node
55
    var candidates []Suggestion
56
    if t.isEnd {
57
        candidates = append(candidates, t.topSuggestions...)
58
    }
59
    for _, child := range t.children {
60
        candidates = append(candidates, child.topSuggestions...)
61
    }
62

63
    // Keep top K by score
64
    sort.Slice(candidates, func(i, j int) bool {
65
        return candidates[i].Score > candidates[j].Score
66
    })
67
    if len(candidates) > TopK {
68
        candidates = candidates[:TopK]
69
    }
70
    t.topSuggestions = candidates
71
}

Why pre-compute top-K at each node:

Without pre-computation, returning suggestions for prefix “p” requires traversing the entire subtree (potentially millions of nodes). With pre-computed top-K:

Query time: O(p) where p = prefix length
No subtree traversal
Predictable latency regardless of prefix popularity

Trade-off: Index build time increases (must propagate top-K up), but query time drops from O(p + n) to O(p).

Ranking Algorithm

Scoring formula:

1
Score = w1 × Popularity + w2 × Freshness + w3 × Trending + w4 × Personalization

Default weights (generic suggestions):

Signal	Weight	Calculation
Popularity	0.5	`log(frequency) / log(max_frequency)`
Freshness	0.2	`1 - (days_since_last_search / 30)`
Trending	0.2	`1.0 if spiking else 0.0`
Personalization	0.1	`1.0 if in user history else 0.0`

Trending detection:

A query is “trending” if its frequency in the last hour exceeds 3× its average hourly frequency over the past week.


5 collapsed lines
1
import redis
2
from datetime import datetime, timedelta
3

4
r = redis.Redis()
5

6
def is_trending(query: str) -> bool:
7
    now = datetime.utcnow()
8
    hour_key = f"freq:{query}:{now.strftime('%Y%m%d%H')}"
9

10
    # Current hour frequency
11
    current = int(r.get(hour_key) or 0)
12

13
    # Average hourly frequency over past week
14
    total = 0
15
    for i in range(1, 169):  # 168 hours = 1 week
16
        past_hour = now - timedelta(hours=i)
17
        past_key = f"freq:{query}:{past_hour.strftime('%Y%m%d%H')}"
18
        total += int(r.get(past_key) or 0)
19

9 collapsed lines
20
    avg = total / 168
21
    return current > 3 * avg if avg > 0 else current > 100
22

23
# Example usage in ranking service
24
def rank_suggestions(suggestions, user_id=None):
25
    for s in suggestions:
26
        s.trending = is_trending(s.text)
27
        s.score = calculate_score(s, user_id)
28
    return sorted(suggestions, key=lambda x: x.score, reverse=True)

Elasticsearch Alternative (Path B)

For teams preferring managed infrastructure, Elasticsearch’s completion suggester provides a viable alternative:

Index mapping:


3 collapsed lines
1
{
2
  "mappings": {
3
    "properties": {
4
      "suggest": {
5
        "type": "completion",
6
        "analyzer": "simple",
7
        "preserve_separators": true,
8
        "preserve_position_increments": true,
9
        "max_input_length": 50,
10
        "contexts": [
11
          {
12
            "name": "category",
13
            "type": "category"
14
          },
15
          {
16
            "name": "location",
17
            "type": "geo",
18
            "precision": 4
19
          }
20
        ]
21
      },
22
      "query_text": {
23
        "type": "text"
24
      },
6 collapsed lines
25
      "frequency": {
26
        "type": "long"
27
      }
28
    }
29
  }
30
}

Query:

1
{
2
  "suggest": {
3
    "query-suggest": {
4
      "prefix": "prog",
5
      "completion": {
6
        "field": "suggest",
7
        "size": 10,
8
        "skip_duplicates": true,
9
        "fuzzy": {
10
          "fuzziness": 1
11
        },
12
        "contexts": {
13
          "category": ["technology"]
14
        }
15
      }
16
    }
17
  }
18
}

Performance characteristics:

Latency: 10-30ms typical (vs. <10ms for custom trie)
Fuzzy matching: Built-in with configurable edit distance
Index size: 15-17× larger with edge n-gram analyzer
Operational: Managed service available (AWS OpenSearch, Elastic Cloud)

When to choose Elasticsearch:

Already running ES for document search
Need fuzzy matching without additional infrastructure
Corpus changes frequently (near real-time indexing)
Team lacks expertise for custom trie infrastructure

Frontend Considerations

Debouncing Strategy

Problem: Without debouncing, typing “javascript” at normal speed (150ms between keystrokes) generates 10 API requests in 1.5 seconds.

Solution: Debounce with 300ms delay—only send request after 300ms of no typing.


5 collapsed lines
1
import { useState, useCallback, useRef, useEffect } from "react"
2

3
const DEBOUNCE_MS = 300
4
const MIN_PREFIX_LENGTH = 2
5

6
export function useAutocomplete() {
7
  const [query, setQuery] = useState("")
8
  const [suggestions, setSuggestions] = useState<string[]>([])
9
  const [isLoading, setIsLoading] = useState(false)
10
  const abortControllerRef = useRef<AbortController | null>(null)
11
  const timeoutRef = useRef<number | null>(null)
12

13
  const fetchSuggestions = useCallback(async (prefix: string) => {
14
    // Cancel previous request
15
    abortControllerRef.current?.abort()
16
    abortControllerRef.current = new AbortController()
17

18
    setIsLoading(true)
19
    try {
20
      const response = await fetch(`/api/v1/suggestions?q=${encodeURIComponent(prefix)}&limit=10`, {
21
        signal: abortControllerRef.current.signal,
22
      })
23
      const data = await response.json()
24
      setSuggestions(data.suggestions.map((s: any) => s.text))
25
    } catch (error) {
26
      if (error.name !== "AbortError") {
27
        console.error("Autocomplete error:", error)
28
      }
29
    } finally {
30
      setIsLoading(false)
31
    }
32
  }, [])
33

34
  const handleInputChange = useCallback(
16 collapsed lines
35
    (value: string) => {
36
      setQuery(value)
37

38
      // Clear pending debounce
39
      if (timeoutRef.current) {
40
        clearTimeout(timeoutRef.current)
41
      }
42

43
      // Don't fetch for short prefixes
44
      if (value.length < MIN_PREFIX_LENGTH) {
45
        setSuggestions([])
46
        return
47
      }
48

49
      // Debounce the API call
50
      timeoutRef.current = setTimeout(() => {
51
        fetchSuggestions(value)
52
      }, DEBOUNCE_MS)
53
    },
54
    [fetchSuggestions],
55
  )
56

57
  // Cleanup on unmount
58
  useEffect(() => {
59
    return () => {
60
      timeoutRef.current && clearTimeout(timeoutRef.current)
61
      abortControllerRef.current?.abort()
62
    }
63
  }, [])
64

65
  return { query, suggestions, isLoading, handleInputChange }
66
}

Key implementation details:

AbortController: Cancel in-flight requests when user types more
Minimum prefix: Don’t fetch for 1-character prefixes (too broad)
Cleanup: Clear timeouts and abort requests on unmount

Autocomplete dropdowns must support keyboard navigation for accessibility (WCAG 2.1 compliance):

Key	Action
↓ / ↑	Navigate suggestions
Enter	Select highlighted suggestion
Escape	Close dropdown
Tab	Select and move to next field

ARIA attributes required:

1
<input role="combobox" aria-expanded="true" aria-controls="suggestions-list" aria-activedescendant="suggestion-2" />
2
<ul id="suggestions-list" role="listbox">
3
  <li id="suggestion-0" role="option">programming</li>
4
  <li id="suggestion-1" role="option">progress</li>
5
  <li id="suggestion-2" role="option" aria-selected="true">program</li>
6
</ul>

Optimistic UI Updates

For frequently-typed prefixes, show cached suggestions immediately while fetching fresh results:

1
const handleInputChange = (value: string) => {
2
  // Show cached results immediately (optimistic)
3
  const cached = localCache.get(value)
4
  if (cached) {
5
    setSuggestions(cached)
6
  }
7

8
  // Fetch fresh results in background
9
  fetchSuggestions(value).then((fresh) => {
10
    setSuggestions(fresh)
11
    localCache.set(value, fresh)
12
  })
13
}

Aspect	Batch (Weekly)	Streaming (Real-time)
Latency	Hours	Seconds
Completeness	Full corpus	Incremental deltas
Compute cost	Higher	Lower
Use case	Stable rankings	Trending queries

Dual-path approach:

Weekly batch: Complete trie rebuild from all query logs
Hourly hot updates: Merge trending queries into existing trie
Real-time streaming: Update trending flags and frequency counters

Aggregation Job


10 collapsed lines
1
from pyspark.sql import SparkSession
2
from pyspark.sql.functions import col, count, max as spark_max, when
3

4
spark = SparkSession.builder.appName("QueryAggregation").getOrCreate()
5

6
# Read query logs from S3
7
query_logs = spark.read.parquet("s3://query-logs/2024/02/")
8

9
# Filter and aggregate
10
aggregated = (
11
    query_logs
12
    .filter(col("query").isNotNull())
13
    .filter(col("query") != "")
14
    .filter(~col("query").rlike(r"[^\w\s]"))  # Remove special chars
15
    .groupBy("query")
16
    .agg(
17
        count("*").alias("frequency"),
18
        spark_max("timestamp").alias("last_seen"),
19
        # Trending: frequency in last 24h > 3x avg
20
        when(
21
            col("recent_freq") > 3 * col("avg_freq"),
22
            True
23
        ).otherwise(False).alias("trending")
24
    )
25
    .filter(col("frequency") >= 10)  # Min frequency threshold
26
    .orderBy(col("frequency").desc())
27
)
28

29
# Normalize scores
30
max_freq = aggregated.agg(spark_max("frequency")).collect()[0][0]
31
scored = aggregated.withColumn(
32
    "score",
33
    col("frequency") / max_freq * 0.5 +  # Popularity
34
    when(col("trending"), 0.2).otherwise(0.0)  # Trending boost
35
)
36

37
# Write output for trie builder
38
scored.write.parquet("s3://autocomplete-data/aggregated/2024-02-03/")

Trie Serialization

For efficient storage and loading, tries are serialized to a compact binary format:


8 collapsed lines
1
package autocomplete
2

3
import (
4
    "encoding/gob"
5
    "compress/gzip"
6
    "os"
7
)
8

9
func (t *TrieNode) Serialize(path string) error {
10
    file, err := os.Create(path)
11
    if err != nil {
12
        return err
13
    }
14
    defer file.Close()
15

16
    gzWriter := gzip.NewWriter(file)
17
    defer gzWriter.Close()
18

19
    encoder := gob.NewEncoder(gzWriter)
20
    return encoder.Encode(t)
21
}
22

23
func LoadTrie(path string) (*TrieNode, error) {
24
    file, err := os.Open(path)
25
    if err != nil {
26
        return nil, err
27
    }
28
    defer file.Close()
29

30
    gzReader, err := gzip.NewReader(file)
31
    if err != nil {
32
        return nil, err
33
    }
34
    defer gzReader.Close()
35

36
    var trie TrieNode
37
    decoder := gob.NewDecoder(gzReader)
38
    if err := decoder.Decode(&trie); err != nil {
39
        return nil, err
3 collapsed lines
40
    }
41
    return &trie, nil
42
}

Compression ratios:

Format	Size	Load Time
Raw (JSON)	200 GB	10 min
Gob	80 GB	4 min
Gob + Gzip	30 GB	6 min

Chosen: Gob + Gzip for storage efficiency. Load time overhead acceptable for weekly rebuilds.

Caching Strategy

Multi-Layer Cache

Layer	TTL	Hit Rate	Latency
Browser	1 hour	30-40%	0ms
CDN/Edge	5 min	50-60%	<5ms
Redis (hot prefixes)	10 min	80-90%	<2ms
Trie (in-memory)	N/A	100%	<10ms

Browser Caching

1
HTTP/1.1 200 OK
2
Cache-Control: public, max-age=3600
3
ETag: "v1-abc123"
4
Vary: Accept-Encoding

1-hour TTL balances freshness and hit rate
ETag enables conditional requests for stale cache revalidation
Vary header ensures proper cache keying by encoding

CDN Configuration

1
# Cloudflare Page Rules
2
rules:
3
  - match: "/api/v1/suggestions*"
4
    actions:
5
      cache_level: cache_everything
6
      edge_cache_ttl: 300 # 5 minutes
7
      browser_cache_ttl: 3600 # 1 hour
8
      cache_key:
9
        include_query_string: true

Cache key design:

Include full query string in cache key. /suggestions?q=prog and /suggestions?q=progress must cache separately.

Redis Hot Prefix Cache


5 collapsed lines
1
import redis
2
import json
3

4
r = redis.Redis(cluster=True)
5

6
def get_suggestions(prefix: str) -> list | None:
7
    """Check Redis cache first, fall back to trie."""
8
    cache_key = f"suggest:{prefix}"
9

10
    # Try cache
11
    cached = r.get(cache_key)
12
    if cached:
13
        return json.loads(cached)
14

15
    # Cache miss - query trie
16
    suggestions = query_trie(prefix)
17

18
    # Cache hot prefixes (high frequency)
19
    if is_hot_prefix(prefix):
20
        r.setex(cache_key, 600, json.dumps(suggestions))  # 10 min TTL
21

22
    return suggestions
23

24
def is_hot_prefix(prefix: str) -> bool:
25
    """Prefix is hot if queried >1000 times in last hour."""
26
    freq_key = f"freq:{prefix}:{current_hour()}"
27
    return int(r.get(freq_key) or 0) > 1000

Infrastructure

Cloud-Agnostic Architecture

Component	Requirement	Open Source	Managed
Compute	Low-latency, auto-scaling	Kubernetes	EKS, GKE
Cache	Sub-ms reads, clustering	Redis	ElastiCache, MemoryStore
Message Queue	High throughput, durability	Kafka	MSK, Confluent Cloud
Object Storage	Durable, versioned	MinIO	S3, GCS
Stream Processing	Real-time aggregation	Flink, Spark	Kinesis, Dataflow

AWS Reference Architecture

Service configuration:

Service	Configuration	Cost Estimate
ECS Fargate	10 tasks × 16 vCPU, 32GB RAM	$8,000/month
ElastiCache	r6g.xlarge × 6 nodes (cluster)	$2,500/month
CloudFront	1TB egress/day	$1,500/month
S3	500GB storage	$12/month
EMR	m5.xlarge × 10 (weekly job)	$200/month

Total estimated cost: ~$12,000/month for 500K QPS capacity

Deployment Strategy

Blue-green deployment for trie updates:

Build new trie version in offline cluster
Load into “green” service instances
Smoke test green cluster
Switch load balancer to green
Keep blue running for 1 hour (rollback capability)
Terminate blue

Rolling updates for code changes:


8 collapsed lines
1
# ECS Service Definition
2
apiVersion: ecs/v1
3
kind: Service
4
metadata:
5
  name: trie-service
6
spec:
7
  desiredCount: 10
8
  deploymentConfiguration:
9
    maximumPercent: 150
10
    minimumHealthyPercent: 100
11
  deploymentController:
12
    type: ECS
13
  healthCheckGracePeriodSeconds: 60
14
  loadBalancers:
15
    - containerName: trie
16
      containerPort: 8080
17
      targetGroupArn: !Ref TargetGroup

Monitoring and Evaluation

Key Metrics

Metric	Target	Alert Threshold
P50 latency	<20ms	>50ms
P99 latency	<100ms	>200ms
Error rate	<0.01%	>0.1%
Cache hit rate (CDN)	>50%	<30%
Cache hit rate (Redis)	>80%	<60%
Trie memory usage	<80%	>90%

Business Metrics

Metric	Definition	Target
Suggestion CTR	Clicks on suggestions / Total suggestions shown	>30%
Mean Reciprocal Rank (MRR)	1/position of clicked suggestion	>0.5
Query completion rate	Searches using suggestion / Total searches	>40%
Keystrokes saved	Avg chars typed before selecting suggestion	>50%

Observability Stack

1
monitors:
2
  - name: Autocomplete P99 Latency
3
    type: metric alert
4
    query: "avg(last_5m):p99:autocomplete.latency{*} > 100"
5
    message: "P99 latency exceeded 100ms threshold"
6

7
  - name: Trie Service Error Rate
8
    type: metric alert
9
    query: "sum(last_5m):sum:autocomplete.errors{*}.as_rate() / sum:autocomplete.requests{*}.as_rate() > 0.001"
10
    message: "Error rate exceeded 0.1%"
11

12
  - name: Cache Hit Rate Drop
13
    type: metric alert
14
    query: "avg(last_15m):autocomplete.cache.hit_rate{layer:redis} < 0.6"
15
    message: "Redis cache hit rate below 60%"

Conclusion

This design delivers sub-100ms autocomplete at scale through:

Pre-computed top-K suggestions at each trie node, eliminating runtime ranking
Prefix-based sharding with dynamic rebalancing for even load distribution
Multi-layer caching (browser → CDN → Redis → trie) achieving >90% cache hit rate
Dual-path indexing combining weekly batch rebuilds with streaming hot updates

Key tradeoffs accepted:

Memory over compute: ~200GB RAM per shard for O(p) query latency
Staleness over freshness: 1-hour maximum for non-trending suggestions
Generic over personalized: Personalization adds 10-50ms latency

Known limitations:

Fuzzy matching requires additional infrastructure (not in core trie)
Cross-language suggestions need separate tries per language
Cold start after deployment requires pre-warming from snapshots

Alternative approaches not chosen:

Pure inverted index: Higher latency (10-50ms vs <10ms)
Machine learning ranking: Adds latency, diminishing returns for autocomplete
Real-time personalization: Latency cost outweighs relevance benefit for most use cases

Appendix

Prerequisites

Distributed systems fundamentals (sharding, replication, consistency)
Data structures: tries, hash maps, sorted sets
Basic understanding of caching strategies (TTL, cache invalidation)
Familiarity with stream processing concepts (Kafka, event sourcing)

Terminology

Term	Definition
Trie	Tree structure for prefix-based string storage and retrieval
FST	Finite State Transducer—compressed automaton for term dictionaries
Top-K	Pre-computed list of K highest-scoring suggestions at a trie node
QAC	Query Auto-Completion—suggesting queries, not documents
MRR	Mean Reciprocal Rank—evaluation metric for ranked results
Edge n-gram	Tokenization that generates prefixes at index time
Fan-out	Pattern of distributing data/computation across multiple nodes

Summary

Trie with pre-computed top-K provides O(p) query latency independent of corpus size
Prefix-based sharding enables horizontal scaling with data locality benefits
300ms client debouncing reduces API calls by 90%+ without perceived latency
Multi-layer caching (browser, CDN, Redis) handles >90% of traffic before hitting trie services
Weekly batch + streaming hot updates balances freshness with stability
P99 < 100ms is achievable with proper caching and pre-computation

References

LinkedIn Engineering: Cleo - Open Source Typeahead - Production architecture serving <1ms generic suggestions
Twitter Engineering: Typeahead.js - Client-side typeahead library design
Microsoft Research: Personalized Query Auto-Completion - Personalization impact on MRR (+9.42%)
Elasticsearch: Autocomplete Search - Completion suggester and edge n-gram comparison
Mike McCandless: FSTs in Lucene - Finite State Transducer internals
Bing Search Blog: Autosuggest Deep Dive - Parallel processing for suggestions
CSS-Tricks: Debouncing and Throttling - Client-side optimization patterns

Read more