Design Search Autocomplete: Prefix Matching at Scale
A system design for search autocomplete (typeahead) covering prefix data structures, ranking algorithms, distributed architecture, and sub-100ms latency requirements. This design addresses the challenge of returning relevant suggestions within the user’s typing cadence—typically under 100ms—while handling billions of queries daily.
Abstract
Search autocomplete is a prefix-completion problem where every millisecond matters—users expect suggestions before they finish typing. The core mental model:
- Data structure choice determines latency floor: Tries provide O(p) prefix lookup where p is prefix length, independent of corpus size. Finite State Transducers (FST) compress this further for in-memory serving.
- Pre-computed top-K eliminates runtime ranking: Store the top 5-10 suggestions at each trie node. Query becomes pure traversal, no sorting.
- Freshness requires a dual-path architecture: Weekly batch rebuilds for stable rankings + streaming hot updates for trending queries.
- Client-side debouncing is mandatory: Without 300ms debounce, typing “javascript” generates 10 requests in 1 second. With debounce: 1 request.
- Personalization adds latency: Generic suggestions serve from cache in <5ms. Personalized suggestions require user context lookup, adding 10-50ms.
The fundamental tradeoff: latency vs. relevance. Pre-computed suggestions are fast but stale. Real-time ranking is relevant but slow. Production systems layer both.
Requirements
Functional Requirements
| Requirement | Priority | Notes |
|---|---|---|
| Return suggestions for partial query | Core | Primary feature |
| Rank by relevance (popularity, freshness) | Core | Not just alphabetical |
| Support trending/breaking queries | Core | News events, viral content |
| Personalized suggestions | Extended | Based on user history |
| Spell correction / fuzzy matching | Extended | Handle typos |
| Multi-language support | Extended | Unicode, RTL scripts |
Out of scope: Full document search (separate system), voice input, image search.
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Latency | P99 < 100ms | User typing cadence ~150ms between keystrokes |
| Availability | 99.99% | User-facing, affects search engagement |
| Throughput | 500K QPS peak | Based on scale estimation below |
| Suggestion freshness | < 1 hour for trending | Breaking news must surface quickly |
| Consistency | Eventual (< 5 min) | Acceptable for suggestions |
Scale Estimation
Assumptions (Google-scale reference):
- Daily Active Users (DAU): 1 billion
- Searches per user per day: 5
- Characters per search: 20 average
- Autocomplete triggers: Every 2-3 characters after minimum prefix
Traffic calculation:
Queries/day = 1B users × 5 searches × (20 chars / 3 chars per trigger) = 1B × 5 × 7 triggers = 35 billion autocomplete requests/day
QPS average = 35B / 86,400 = ~405K QPSQPS peak (3x) = ~1.2M QPSStorage calculation:
Unique queries to index: ~1 billion (estimated from query logs)Average query length: 25 bytesMetadata per query (score, timestamp): 16 bytesRaw storage: 1B × 41 bytes = 41 GB
Trie overhead (pointers, node structure): 3-5xTrie storage: ~150-200 GB per replicaBandwidth:
Request size: ~50 bytes (prefix + metadata)Response size: ~500 bytes (10 suggestions with metadata)
Ingress: 1.2M QPS × 50 bytes = 60 MB/sEgress: 1.2M QPS × 500 bytes = 600 MB/sDesign Paths
Path A: Trie-Based with Pre-computed Rankings
Best when:
- Suggestion corpus is bounded (millions, not billions of unique queries)
- Latency is critical (<50ms P99)
- Ranking signals are relatively stable (popularity-based)
Architecture:
- In-memory tries sharded by prefix range
- Top-K suggestions pre-computed at each node during indexing
- Weekly full rebuilds + hourly delta updates
Key characteristics:
- Query is pure traversal: O(p) where p = prefix length
- No runtime ranking computation
- Memory-intensive: entire trie must fit in RAM
Trade-offs:
- ✅ Sub-10ms query latency achievable
- ✅ Predictable, consistent performance
- ✅ Simple query path (no scoring logic)
- ❌ High memory footprint (~200GB per shard replica)
- ❌ Freshness limited by rebuild frequency
- ❌ Personalization requires separate lookup
Real-world example: LinkedIn’s Cleo serves generic typeahead in <1ms using pre-computed tries. Network-personalized suggestions take 15ms due to additional context lookups.
Path B: Inverted Index with Completion Suggester
Best when:
- Corpus is large and dynamic (e-commerce catalogs, document search)
- Need flexibility for different query types (prefix, fuzzy, phrase)
- Already running Elasticsearch/OpenSearch infrastructure
Architecture:
- Elasticsearch completion suggester using FST (Finite State Transducers)
- Edge n-gram tokenization for flexible matching
- Real-time indexing via Kafka
Key characteristics:
- FST provides compact in-memory representation
- Supports fuzzy matching and context filtering
- Index updates are near real-time
Trade-offs:
- ✅ Flexible query types (prefix, infix, fuzzy)
- ✅ Real-time updates without full rebuild
- ✅ Built-in sharding and replication
- ❌ Higher latency than pure trie (10-50ms typical)
- ❌ Index size 15-17x larger with edge n-gram analyzer
- ❌ Operational complexity of Elasticsearch cluster
Real-world example: Amazon product search uses inverted indexes with extensive metadata (ratings, sales rank) for ranking. Handles dynamic catalog updates in near real-time.
Path Comparison
| Factor | Path A: Trie | Path B: Inverted Index |
|---|---|---|
| Query latency | <10ms | 10-50ms |
| Memory efficiency | Lower (pointer overhead) | Higher (FST compression) |
| Update latency | Hours (batch) | Seconds (streaming) |
| Fuzzy matching | Requires separate structure | Native support |
| Sharding complexity | Manual prefix-based | Built-in (Elasticsearch) |
| Operational overhead | Custom infrastructure | Managed service available |
| Best for | High-QPS generic suggestions | Dynamic catalogs, flexible queries |
This Article’s Focus
This article focuses on Path A (Trie-Based) because:
- It represents the canonical autocomplete architecture used by Google, Twitter, and LinkedIn for their primary suggestion services
- It demonstrates fundamental prefix data structures that underpin even inverted-index implementations
- Sub-10ms latency is achievable, which Path B cannot match
Path B implementation details are covered in the “Elasticsearch Alternative” section under Low-Level Design.
High-Level Design
Component Overview
| Component | Responsibility | Technology |
|---|---|---|
| API Gateway | Rate limiting, authentication, routing | Kong, AWS API Gateway |
| Shard Router | Route prefix to correct trie shard | Custom service |
| Trie Service | Serve suggestions from in-memory trie | Custom service (Go/Rust) |
| Ranking Service | Re-rank with personalization signals | Custom service |
| Redis Cache | Cache hot prefixes and user history | Redis Cluster |
| Trie Builder | Build/update tries from query logs | Spark/Flink |
| Kafka | Stream query logs and trending signals | Apache Kafka |
| Object Storage | Store serialized trie snapshots | S3, HDFS |
Request Flow
Sharding Strategy
Prefix-based sharding routes queries by first character(s):
Shard 1: a-f (covers ~25% of queries)Shard 2: g-m (covers ~20% of queries)Shard 3: n-s (covers ~35% of queries - 's' is most common)Shard 4: t-z (covers ~20% of queries)Why prefix-based over hash-based:
- Data locality: Related queries (“system”, “systems”, “systematic”) on same shard
- Prefix routing: Query “sys” deterministically routes to shard 3
- Range queries: Can aggregate suggestions across prefix ranges
Handling hotspots:
English letter frequency is uneven. The prefix “s” alone accounts for ~10% of queries. Solutions:
- Finer granularity: Split “s” into “sa-se”, “sf-sm”, “sn-sz”
- Dynamic rebalancing: Shard Map Manager monitors load and adjusts ranges
- Replication: More replicas for hot shards
API Design
Suggestion Endpoint
GET /api/v1/suggestions?q={prefix}&limit={n}&lang={code}Request Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
q | string | Yes | - | Query prefix (min 2 chars) |
limit | int | No | 10 | Max suggestions (1-20) |
lang | string | No | en | Language code (ISO 639-1) |
user_id | string | No | - | For personalized suggestions |
context | string | No | - | Search context (web, images, news) |
Response (200 OK):
{ "query": "prog", "suggestions": [ { "text": "programming", "score": 0.95, "type": "query", "metadata": { "category": "technology", "trending": false } }, { "text": "progress", "score": 0.87, "type": "query", "metadata": { "category": "general", "trending": false } }, { "text": "program download", "score": 0.82, "type": "query", "metadata": { "category": "technology", "trending": true } } ], "took_ms": 8, "cache_hit": false}Error Responses:
| Status | Condition | Response |
|---|---|---|
| 400 | Prefix too short (<2 chars) | {"error": "prefix_too_short", "min_length": 2} |
| 429 | Rate limit exceeded | {"error": "rate_limited", "retry_after": 60} |
| 503 | Service overloaded | {"error": "service_unavailable"} |
Rate Limits:
- Anonymous: 100 requests/minute per IP
- Authenticated: 1000 requests/minute per user
- Burst: 20 requests/second max
Response Optimization
Compression: Enable gzip for responses >1KB. Typical 10-suggestion response compresses from 500 bytes to ~200 bytes.
Cache Headers:
Cache-Control: public, max-age=300ETag: "a1b2c3d4"Vary: Accept-Encoding, Accept-LanguagePagination: Not applicable—autocomplete returns bounded set (≤20). If more results needed, user should submit full search.
Data Modeling
Trie Node Structure
interface TrieNode { children: Map<string, TrieNode> // Character -> child node isEndOfWord: boolean topSuggestions: Suggestion[] // Pre-computed top-K frequency: number // Aggregate frequency for this prefix}
interface Suggestion { text: string // Full query text score: number // Normalized relevance score [0, 1] frequency: number // Raw query count lastUpdated: number // Unix timestamp trending: boolean // Recently spiking metadata: { category?: string language: string }}Storage Schema
Query Log (Kafka topic: query-logs):
{ "query": "programming tutorials", "timestamp": 1706918400, "user_id": "u123", // Hashed, optional "session_id": "s456", "result_clicked": true, "position_clicked": 2, "locale": "en-US", "platform": "web"}Aggregated Query Stats (Redis Hash):
HSET query:programming tutorials frequency 1542389 last_seen 1706918400 trending 0 category technologyTrie Snapshot (S3/HDFS):
s3://autocomplete-data/tries/ └── 2024-02-03/ ├── shard-a-f.trie.gz (50GB compressed) ├── shard-g-m.trie.gz (40GB compressed) ├── shard-n-s.trie.gz (60GB compressed) ├── shard-t-z.trie.gz (40GB compressed) └── manifest.jsonDatabase Selection
| Data | Store | Rationale |
|---|---|---|
| Live trie | In-memory (custom) | Sub-ms traversal required |
| Hot prefix cache | Redis Cluster | <1ms lookups, TTL support |
| Query logs | Kafka → S3 | Streaming ingestion, durable storage |
| Trie snapshots | S3/HDFS | Large files, versioned, cross-region replication |
| User history | DynamoDB/Cassandra | Key-value access pattern, high write throughput |
| Trending signals | Redis Sorted Set | Real-time top-K with scores |
Low-Level Design
Trie Implementation
Design decision: Hash map vs. array children
| Approach | Lookup | Memory | Best for |
|---|---|---|---|
| Array[26] | O(1) | 26 pointers/node | Dense tries, ASCII only |
| Array[128] | O(1) | 128 pointers/node | Full ASCII |
| HashMap | O(1) avg | Variable | Sparse tries, Unicode |
Chosen: HashMap because:
- Unicode support required (multi-language)
- Most nodes have <5 children (sparse)
- Modern hash maps have near-constant lookup
8 collapsed lines
package autocomplete
import ( "sort" "sync")
const TopK = 10
type TrieNode struct { children map[rune]*TrieNode isEnd bool topSuggestions []Suggestion mu sync.RWMutex}
type Suggestion struct { Text string Score float64 Frequency int64 Trending bool}
func (t *TrieNode) Search(prefix string) []Suggestion { t.mu.RLock() defer t.mu.RUnlock()
node := t for _, char := range prefix { child, exists := node.children[char] if !exists { return nil // No suggestions for this prefix } node = child } // Return pre-computed top-K at this node return node.topSuggestions}
func (t *TrieNode) Insert(word string, score float64) { t.mu.Lock() defer t.mu.Unlock() // ... insertion logic with top-K update propagation}16 collapsed lines
// BuildTopK propagates top-K suggestions up the trie// Called during index build, not at query timefunc (t *TrieNode) BuildTopK() { // Post-order traversal: build children first for _, child := range t.children { child.BuildTopK() }
// Collect all suggestions from children + this node var candidates []Suggestion if t.isEnd { candidates = append(candidates, t.topSuggestions...) } for _, child := range t.children { candidates = append(candidates, child.topSuggestions...) }
// Keep top K by score sort.Slice(candidates, func(i, j int) bool { return candidates[i].Score > candidates[j].Score }) if len(candidates) > TopK { candidates = candidates[:TopK] } t.topSuggestions = candidates}Why pre-compute top-K at each node:
Without pre-computation, returning suggestions for prefix “p” requires traversing the entire subtree (potentially millions of nodes). With pre-computed top-K:
- Query time: O(p) where p = prefix length
- No subtree traversal
- Predictable latency regardless of prefix popularity
Trade-off: Index build time increases (must propagate top-K up), but query time drops from O(p + n) to O(p).
Ranking Algorithm
Scoring formula:
Score = w1 × Popularity + w2 × Freshness + w3 × Trending + w4 × PersonalizationDefault weights (generic suggestions):
| Signal | Weight | Calculation |
|---|---|---|
| Popularity | 0.5 | log(frequency) / log(max_frequency) |
| Freshness | 0.2 | 1 - (days_since_last_search / 30) |
| Trending | 0.2 | 1.0 if spiking else 0.0 |
| Personalization | 0.1 | 1.0 if in user history else 0.0 |
Trending detection:
A query is “trending” if its frequency in the last hour exceeds 3× its average hourly frequency over the past week.
5 collapsed lines
import redisfrom datetime import datetime, timedelta
r = redis.Redis()
def is_trending(query: str) -> bool: now = datetime.utcnow() hour_key = f"freq:{query}:{now.strftime('%Y%m%d%H')}"
# Current hour frequency current = int(r.get(hour_key) or 0)
# Average hourly frequency over past week total = 0 for i in range(1, 169): # 168 hours = 1 week past_hour = now - timedelta(hours=i) past_key = f"freq:{query}:{past_hour.strftime('%Y%m%d%H')}" total += int(r.get(past_key) or 0)
9 collapsed lines
avg = total / 168 return current > 3 * avg if avg > 0 else current > 100
# Example usage in ranking servicedef rank_suggestions(suggestions, user_id=None): for s in suggestions: s.trending = is_trending(s.text) s.score = calculate_score(s, user_id) return sorted(suggestions, key=lambda x: x.score, reverse=True)Elasticsearch Alternative (Path B)
For teams preferring managed infrastructure, Elasticsearch’s completion suggester provides a viable alternative:
Index mapping:
3 collapsed lines
{ "mappings": { "properties": { "suggest": { "type": "completion", "analyzer": "simple", "preserve_separators": true, "preserve_position_increments": true, "max_input_length": 50, "contexts": [ { "name": "category", "type": "category" }, { "name": "location", "type": "geo", "precision": 4 } ] }, "query_text": { "type": "text" },6 collapsed lines
"frequency": { "type": "long" } } }}Query:
{ "suggest": { "query-suggest": { "prefix": "prog", "completion": { "field": "suggest", "size": 10, "skip_duplicates": true, "fuzzy": { "fuzziness": 1 }, "contexts": { "category": ["technology"] } } } }}Performance characteristics:
- Latency: 10-30ms typical (vs. <10ms for custom trie)
- Fuzzy matching: Built-in with configurable edit distance
- Index size: 15-17× larger with edge n-gram analyzer
- Operational: Managed service available (AWS OpenSearch, Elastic Cloud)
When to choose Elasticsearch:
- Already running ES for document search
- Need fuzzy matching without additional infrastructure
- Corpus changes frequently (near real-time indexing)
- Team lacks expertise for custom trie infrastructure
Frontend Considerations
Debouncing Strategy
Problem: Without debouncing, typing “javascript” at normal speed (150ms between keystrokes) generates 10 API requests in 1.5 seconds.
Solution: Debounce with 300ms delay—only send request after 300ms of no typing.
5 collapsed lines
import { useState, useCallback, useRef, useEffect } from "react"
const DEBOUNCE_MS = 300const MIN_PREFIX_LENGTH = 2
export function useAutocomplete() { const [query, setQuery] = useState("") const [suggestions, setSuggestions] = useState<string[]>([]) const [isLoading, setIsLoading] = useState(false) const abortControllerRef = useRef<AbortController | null>(null) const timeoutRef = useRef<number | null>(null)
const fetchSuggestions = useCallback(async (prefix: string) => { // Cancel previous request abortControllerRef.current?.abort() abortControllerRef.current = new AbortController()
setIsLoading(true) try { const response = await fetch(`/api/v1/suggestions?q=${encodeURIComponent(prefix)}&limit=10`, { signal: abortControllerRef.current.signal, }) const data = await response.json() setSuggestions(data.suggestions.map((s: any) => s.text)) } catch (error) { if (error.name !== "AbortError") { console.error("Autocomplete error:", error) } } finally { setIsLoading(false) } }, [])
const handleInputChange = useCallback(16 collapsed lines
(value: string) => { setQuery(value)
// Clear pending debounce if (timeoutRef.current) { clearTimeout(timeoutRef.current) }
// Don't fetch for short prefixes if (value.length < MIN_PREFIX_LENGTH) { setSuggestions([]) return }
// Debounce the API call timeoutRef.current = setTimeout(() => { fetchSuggestions(value) }, DEBOUNCE_MS) }, [fetchSuggestions], )
// Cleanup on unmount useEffect(() => { return () => { timeoutRef.current && clearTimeout(timeoutRef.current) abortControllerRef.current?.abort() } }, [])
return { query, suggestions, isLoading, handleInputChange }}Key implementation details:
- AbortController: Cancel in-flight requests when user types more
- Minimum prefix: Don’t fetch for 1-character prefixes (too broad)
- Cleanup: Clear timeouts and abort requests on unmount
Keyboard Navigation
Autocomplete dropdowns must support keyboard navigation for accessibility (WCAG 2.1 compliance):
| Key | Action |
|---|---|
| ↓ / ↑ | Navigate suggestions |
| Enter | Select highlighted suggestion |
| Escape | Close dropdown |
| Tab | Select and move to next field |
ARIA attributes required:
<input role="combobox" aria-expanded="true" aria-controls="suggestions-list" aria-activedescendant="suggestion-2" /><ul id="suggestions-list" role="listbox"> <li id="suggestion-0" role="option">programming</li> <li id="suggestion-1" role="option">progress</li> <li id="suggestion-2" role="option" aria-selected="true">program</li></ul>Optimistic UI Updates
For frequently-typed prefixes, show cached suggestions immediately while fetching fresh results:
const handleInputChange = (value: string) => { // Show cached results immediately (optimistic) const cached = localCache.get(value) if (cached) { setSuggestions(cached) }
// Fetch fresh results in background fetchSuggestions(value).then((fresh) => { setSuggestions(fresh) localCache.set(value, fresh) })}Indexing Pipeline
Pipeline Architecture
Batch vs. Streaming Updates
| Aspect | Batch (Weekly) | Streaming (Real-time) |
|---|---|---|
| Latency | Hours | Seconds |
| Completeness | Full corpus | Incremental deltas |
| Compute cost | Higher | Lower |
| Use case | Stable rankings | Trending queries |
Dual-path approach:
- Weekly batch: Complete trie rebuild from all query logs
- Hourly hot updates: Merge trending queries into existing trie
- Real-time streaming: Update trending flags and frequency counters
Aggregation Job
10 collapsed lines
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col, count, max as spark_max, when
spark = SparkSession.builder.appName("QueryAggregation").getOrCreate()
# Read query logs from S3query_logs = spark.read.parquet("s3://query-logs/2024/02/")
# Filter and aggregateaggregated = ( query_logs .filter(col("query").isNotNull()) .filter(col("query") != "") .filter(~col("query").rlike(r"[^\w\s]")) # Remove special chars .groupBy("query") .agg( count("*").alias("frequency"), spark_max("timestamp").alias("last_seen"), # Trending: frequency in last 24h > 3x avg when( col("recent_freq") > 3 * col("avg_freq"), True ).otherwise(False).alias("trending") ) .filter(col("frequency") >= 10) # Min frequency threshold .orderBy(col("frequency").desc()))
# Normalize scoresmax_freq = aggregated.agg(spark_max("frequency")).collect()[0][0]scored = aggregated.withColumn( "score", col("frequency") / max_freq * 0.5 + # Popularity when(col("trending"), 0.2).otherwise(0.0) # Trending boost)
# Write output for trie builderscored.write.parquet("s3://autocomplete-data/aggregated/2024-02-03/")Trie Serialization
For efficient storage and loading, tries are serialized to a compact binary format:
8 collapsed lines
package autocomplete
import ( "encoding/gob" "compress/gzip" "os")
func (t *TrieNode) Serialize(path string) error { file, err := os.Create(path) if err != nil { return err } defer file.Close()
gzWriter := gzip.NewWriter(file) defer gzWriter.Close()
encoder := gob.NewEncoder(gzWriter) return encoder.Encode(t)}
func LoadTrie(path string) (*TrieNode, error) { file, err := os.Open(path) if err != nil { return nil, err } defer file.Close()
gzReader, err := gzip.NewReader(file) if err != nil { return nil, err } defer gzReader.Close()
var trie TrieNode decoder := gob.NewDecoder(gzReader) if err := decoder.Decode(&trie); err != nil { return nil, err3 collapsed lines
} return &trie, nil}Compression ratios:
| Format | Size | Load Time |
|---|---|---|
| Raw (JSON) | 200 GB | 10 min |
| Gob | 80 GB | 4 min |
| Gob + Gzip | 30 GB | 6 min |
Chosen: Gob + Gzip for storage efficiency. Load time overhead acceptable for weekly rebuilds.
Caching Strategy
Multi-Layer Cache
| Layer | TTL | Hit Rate | Latency |
|---|---|---|---|
| Browser | 1 hour | 30-40% | 0ms |
| CDN/Edge | 5 min | 50-60% | <5ms |
| Redis (hot prefixes) | 10 min | 80-90% | <2ms |
| Trie (in-memory) | N/A | 100% | <10ms |
Browser Caching
HTTP/1.1 200 OKCache-Control: public, max-age=3600ETag: "v1-abc123"Vary: Accept-Encoding- 1-hour TTL balances freshness and hit rate
- ETag enables conditional requests for stale cache revalidation
Varyheader ensures proper cache keying by encoding
CDN Configuration
# Cloudflare Page Rulesrules: - match: "/api/v1/suggestions*" actions: cache_level: cache_everything edge_cache_ttl: 300 # 5 minutes browser_cache_ttl: 3600 # 1 hour cache_key: include_query_string: trueCache key design:
Include full query string in cache key. /suggestions?q=prog and /suggestions?q=progress must cache separately.
Redis Hot Prefix Cache
5 collapsed lines
import redisimport json
r = redis.Redis(cluster=True)
def get_suggestions(prefix: str) -> list | None: """Check Redis cache first, fall back to trie.""" cache_key = f"suggest:{prefix}"
# Try cache cached = r.get(cache_key) if cached: return json.loads(cached)
# Cache miss - query trie suggestions = query_trie(prefix)
# Cache hot prefixes (high frequency) if is_hot_prefix(prefix): r.setex(cache_key, 600, json.dumps(suggestions)) # 10 min TTL
return suggestions
def is_hot_prefix(prefix: str) -> bool: """Prefix is hot if queried >1000 times in last hour.""" freq_key = f"freq:{prefix}:{current_hour()}" return int(r.get(freq_key) or 0) > 1000Infrastructure
Cloud-Agnostic Architecture
| Component | Requirement | Open Source | Managed |
|---|---|---|---|
| Compute | Low-latency, auto-scaling | Kubernetes | EKS, GKE |
| Cache | Sub-ms reads, clustering | Redis | ElastiCache, MemoryStore |
| Message Queue | High throughput, durability | Kafka | MSK, Confluent Cloud |
| Object Storage | Durable, versioned | MinIO | S3, GCS |
| Stream Processing | Real-time aggregation | Flink, Spark | Kinesis, Dataflow |
AWS Reference Architecture
Service configuration:
| Service | Configuration | Cost Estimate |
|---|---|---|
| ECS Fargate | 10 tasks × 16 vCPU, 32GB RAM | $8,000/month |
| ElastiCache | r6g.xlarge × 6 nodes (cluster) | $2,500/month |
| CloudFront | 1TB egress/day | $1,500/month |
| S3 | 500GB storage | $12/month |
| EMR | m5.xlarge × 10 (weekly job) | $200/month |
Total estimated cost: ~$12,000/month for 500K QPS capacity
Deployment Strategy
Blue-green deployment for trie updates:
- Build new trie version in offline cluster
- Load into “green” service instances
- Smoke test green cluster
- Switch load balancer to green
- Keep blue running for 1 hour (rollback capability)
- Terminate blue
Rolling updates for code changes:
8 collapsed lines
# ECS Service DefinitionapiVersion: ecs/v1kind: Servicemetadata: name: trie-servicespec: desiredCount: 10 deploymentConfiguration: maximumPercent: 150 minimumHealthyPercent: 100 deploymentController: type: ECS healthCheckGracePeriodSeconds: 60 loadBalancers: - containerName: trie containerPort: 8080 targetGroupArn: !Ref TargetGroupMonitoring and Evaluation
Key Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| P50 latency | <20ms | >50ms |
| P99 latency | <100ms | >200ms |
| Error rate | <0.01% | >0.1% |
| Cache hit rate (CDN) | >50% | <30% |
| Cache hit rate (Redis) | >80% | <60% |
| Trie memory usage | <80% | >90% |
Business Metrics
| Metric | Definition | Target |
|---|---|---|
| Suggestion CTR | Clicks on suggestions / Total suggestions shown | >30% |
| Mean Reciprocal Rank (MRR) | 1/position of clicked suggestion | >0.5 |
| Query completion rate | Searches using suggestion / Total searches | >40% |
| Keystrokes saved | Avg chars typed before selecting suggestion | >50% |
Observability Stack
monitors: - name: Autocomplete P99 Latency type: metric alert query: "avg(last_5m):p99:autocomplete.latency{*} > 100" message: "P99 latency exceeded 100ms threshold"
- name: Trie Service Error Rate type: metric alert query: "sum(last_5m):sum:autocomplete.errors{*}.as_rate() / sum:autocomplete.requests{*}.as_rate() > 0.001" message: "Error rate exceeded 0.1%"
- name: Cache Hit Rate Drop type: metric alert query: "avg(last_15m):autocomplete.cache.hit_rate{layer:redis} < 0.6" message: "Redis cache hit rate below 60%"Conclusion
This design delivers sub-100ms autocomplete at scale through:
- Pre-computed top-K suggestions at each trie node, eliminating runtime ranking
- Prefix-based sharding with dynamic rebalancing for even load distribution
- Multi-layer caching (browser → CDN → Redis → trie) achieving >90% cache hit rate
- Dual-path indexing combining weekly batch rebuilds with streaming hot updates
Key tradeoffs accepted:
- Memory over compute: ~200GB RAM per shard for O(p) query latency
- Staleness over freshness: 1-hour maximum for non-trending suggestions
- Generic over personalized: Personalization adds 10-50ms latency
Known limitations:
- Fuzzy matching requires additional infrastructure (not in core trie)
- Cross-language suggestions need separate tries per language
- Cold start after deployment requires pre-warming from snapshots
Alternative approaches not chosen:
- Pure inverted index: Higher latency (10-50ms vs <10ms)
- Machine learning ranking: Adds latency, diminishing returns for autocomplete
- Real-time personalization: Latency cost outweighs relevance benefit for most use cases
Appendix
Prerequisites
- Distributed systems fundamentals (sharding, replication, consistency)
- Data structures: tries, hash maps, sorted sets
- Basic understanding of caching strategies (TTL, cache invalidation)
- Familiarity with stream processing concepts (Kafka, event sourcing)
Terminology
| Term | Definition |
|---|---|
| Trie | Tree structure for prefix-based string storage and retrieval |
| FST | Finite State Transducer—compressed automaton for term dictionaries |
| Top-K | Pre-computed list of K highest-scoring suggestions at a trie node |
| QAC | Query Auto-Completion—suggesting queries, not documents |
| MRR | Mean Reciprocal Rank—evaluation metric for ranked results |
| Edge n-gram | Tokenization that generates prefixes at index time |
| Fan-out | Pattern of distributing data/computation across multiple nodes |
Summary
- Trie with pre-computed top-K provides O(p) query latency independent of corpus size
- Prefix-based sharding enables horizontal scaling with data locality benefits
- 300ms client debouncing reduces API calls by 90%+ without perceived latency
- Multi-layer caching (browser, CDN, Redis) handles >90% of traffic before hitting trie services
- Weekly batch + streaming hot updates balances freshness with stability
- P99 < 100ms is achievable with proper caching and pre-computation
References
- LinkedIn Engineering: Cleo - Open Source Typeahead - Production architecture serving <1ms generic suggestions
- Twitter Engineering: Typeahead.js - Client-side typeahead library design
- Microsoft Research: Personalized Query Auto-Completion - Personalization impact on MRR (+9.42%)
- Elasticsearch: Autocomplete Search - Completion suggester and edge n-gram comparison
- Mike McCandless: FSTs in Lucene - Finite State Transducer internals
- Bing Search Blog: Autosuggest Deep Dive - Parallel processing for suggestions
- CSS-Tricks: Debouncing and Throttling - Client-side optimization patterns
Read more
-
Previous
Design Uber-Style Ride Hailing
System Design / System Design Problems 23 min readA comprehensive system design for a ride-hailing platform handling real-time driver-rider matching, geospatial indexing at scale, dynamic pricing, and sub-second location tracking. This design addresses the core challenges of matching millions of riders with drivers in real-time while optimizing for ETAs, driver utilization, and surge pricing across global markets.
-
Next
Design a Web Crawler
System Design / System Design Problems 29 min readA comprehensive system design for a web-scale crawler that discovers, downloads, and indexes billions of pages. This design addresses URL frontier management with politeness constraints, distributed crawling at scale, duplicate detection, and freshness maintenance across petabytes of web content.