Distributed Search Engine
Search engines trade write-time complexity — building and maintaining inverted indexes — for read-time speed: sub-second queries over collections too large for a single machine. This article reconstructs that bargain end-to-end for senior engineers: how Lucene’s segment architecture works, why every production cluster picks document-based partitioning, what BM25 actually computes, how near-real-time indexing reconciles searchability with durability, and how scatter-gather queries scale (and where they break). The case studies at the end show the same primitives wired into Twitter, Uber, DoorDash, Yelp, and Airbnb.
Mental model
Five primitives explain every design decision in this article:
- Inverted index — the mapping
term → posting listthat turns full-text retrieval from O(corpus) into O(matching docs). Built from a term dictionary and per-term posting lists carrying doc IDs, term frequencies, and optionally positions. - Segment — an immutable mini-index. Lucene never edits a segment; it appends new ones and later merges them. Immutability is what enables lock-free reads, OS page-cache reuse, and crash-safe writes.
- Shard — a single Lucene index. A logical search index is partitioned into N primary shards plus M replicas per primary.
- Coordinator — the node that parses a query, scatters it to shards, gathers and re-ranks the results.
- Translog — a per-shard write-ahead log. Decouples searchability (refresh creates a new searchable segment in the filesystem cache) from durability (flush + fsync makes the segment crash-safe).
Throughout the article, “ES” refers to Elasticsearch and “OS” to OpenSearch; both are thin layers over Apache Lucene. Vendor-specific behavior is called out explicitly.
The inverted index
Why inverted indexes exist
A naive document scan is O(N · L) where N is the corpus size and L is the average document length. The inverted index flips that: instead of asking “which terms appear in document d?” it answers “which documents contain term t?”. Lookup becomes O(1) on the term, plus a streaming scan of the posting list — typically far smaller than the full corpus.
A posting list typically carries:
- Document IDs, ascending and delta-encoded so neighbouring values are small.
- Term frequency (TF) per document — how often the term appears.
- Positions — token offsets, required for phrase queries (
"distributed search"requires adjacent positions). - Payloads — optional per-position metadata (part-of-speech tags, weights, etc.).
Term dictionary: finite state transducers
Lucene stores the term dictionary as a Finite State Transducer (FST), a finite-state machine that maps each term to an opaque value (a block address into the dictionary file). FSTs share both common prefixes and common suffixes; for ["distributed", "distribution", "district"] the path through distri is reused.
The change to a memory-resident FST term dictionary landed under LUCENE-3069 in the 4.x line and was described in the Lucene 4.1 changes as “much more RAM efficient” than the previous fixed-gap index. Mike McCandless’s original write-up walks through the construction algorithm: “We share both prefixes and suffixes… this can be a big space savings.”
Note
The exact memory and lookup wins depend on vocabulary shape. Lucene’s own change log uses qualitative language; specific percentages floating around the web come from cherry-picked vocabularies. Treat them as illustrative, not as a budget.
The trade-off: FSTs are immutable. Adding a term means rebuilding the structure. This is fine in Lucene because segments themselves are immutable — the FST is built once per segment and discarded when the segment is merged.
Posting list compression
Raw posting lists would be enormous — billions of 32- or 64-bit integers per term. Compression is mandatory, and the most-cited survey is Pibiri & Venturini (2020).
| Encoding | Mechanism | Decoding | Compression | When to use |
|---|---|---|---|---|
| Variable-byte (VByte) | 7 data bits per byte; high bit signals continuation. | Byte-aligned, branchy. | Baseline. | Hardware without SIMD; simple implementations. |
| PForDelta | Pick k so most integers fit in k bits; encode larger values as exceptions (“patches”). |
SIMD-friendly block decoding. | ~30–40% better than VByte on skewed integer streams. | Skewed/zipfian streams; modern CPUs. |
| SIMD bit-packing (e.g. ForUtil) | Pack fixed blocks of 128 integers in k bits; decode with vector instructions. |
Decodes whole blocks in one pass. | Comparable to PForDelta with substantially higher decode throughput. | Modern x86/ARM with SIMD; hot scoring loops. |
Lucene’s ForUtil-based block decoding became the default in the 8.x line; SIMD-friendly decoding of postings was tracked under LUCENE-9027 and shipped in Lucene 8.4 (not 8.0), driven by Adrien Grand. Subsequent issues like #12396 push more of the prefix-sum and skip work into vector intrinsics.
Segment-based architecture
Lucene does not maintain a single mutable index. It writes immutable segments — each a complete mini-index with its own term dictionary and posting lists.
Why segments?
- Immutability enables OS-level caching. The kernel page cache holds segment pages indefinitely because nothing rewrites them.
- Lock-free reads. A search uses a point-in-time
SegmentReadersnapshot; writers and merges run concurrently without read locks. - Cheap writes. New documents accumulate in an in-memory buffer and flush as a new segment. Existing data is never rewritten.
- Tombstone deletes. Deletions are recorded in a
.livbitmap; documents are physically removed only during a later merge.
The cost is bookkeeping: every additional segment means an additional term dictionary in RAM, more file handles, and more skip-list cursors at query time. Merge policy is what keeps that cost bounded — see Segment merging below.
Index partitioning strategies
When an index exceeds a single machine’s resources you must partition it. Two strategies exist; one wins in practice.
Document-based (horizontal) partitioning
Each shard holds a complete inverted index over a subset of documents. Writes hash to one shard; queries fan out to all shards.
- Even write distribution: any shard accepts any document.
- Predictable query fanout: always all shards in the routing set.
- Predictable rebalancing: move documents (or split shards in ES 7+).
- Pays for it with high read fanout — every query touches every shard.
This is what Elasticsearch, OpenSearch, Solr, Twitter Earlybird, and Google web search all use 1.
Term-based (vertical) partitioning
Each shard holds posting lists for a subset of terms over all documents. A query for distributed AND search is routed to the shards owning those two terms; their posting lists are intersected (often by streaming the smaller list to the larger one’s owner).
- Selective queries hit fewer shards (good in theory).
- Severely uneven shard sizes: term frequencies are zipfian, so a few shards own enormous posting lists.
- Each indexed document touches one shard per term — write fan-out is O(unique terms in document).
- Rebalancing means moving posting lists across the network.
- Multi-term queries require cross-shard intersection, often shipping postings.
The theoretical fanout savings are real. They are dwarfed by write amplification, hot shards, and load-balancing complexity at scale. No mainstream production search engine uses pure term-based partitioning today.
Decision matrix
| Factor | Document-based | Term-based |
|---|---|---|
| Write cost per doc | 1 shard | O(unique terms) shards |
| Query fanout | All shards in routing set | Only shards owning query terms |
| Shard balance | Even by doc count | Severely zipfian by term frequency |
| Rebalancing | Move documents | Move posting lists |
| Used in production by | Elasticsearch, OpenSearch, Solr, Earlybird, Google | Effectively no one |
Shard sizing guidelines
Elastic’s sizing guidance crystallises the operational defaults:
| Knob | Default rule of thumb |
|---|---|
| Shard size | 10–50 GB (write-heavy workloads bias toward smaller; archival toward larger) |
| Documents per shard | Aim well below 200M |
| Shards per node | Roughly ≤ 20 per GB of JVM heap |
| Primary shards | Fixed at index creation pre-7.0; splittable via the split index API since 7.0 |
The intuition: smaller shards are cheap to recover and rebalance but multiply per-shard overhead (term dictionaries, file handles, coordinator merge work); larger shards minimize that overhead but recover slowly and limit parallelism.
Query processing and ranking
Boolean retrieval and skip lists
Boolean retrieval intersects or unions posting lists:
posting("distributed") = [1, 7, 42, 100, 250]posting("search") = [1, 7, 15, 42, 101]distributed AND search = [1, 7, 42]The naive intersection is O(|A| + |B|). For common_term AND rare_term, where |A| ≈ 10^7 and |B| ≈ 10^2, that wastes nearly all of A. Lucene posting lists embed skip lists: at fixed intervals (one skip per Lucene block — currently 128 docs in the Lucene90 postings format) the postings include the next document ID and a file offset. The intersection iterates the smaller list and advance()s through the larger one, dropping the cost to roughly O(|small| · log(|large|)).
TF-IDF: the foundation
TF-IDF weights term–document pairs by combining how often a term appears in a document with how rare the term is in the corpus.
Where
BM25: the production standard
Okapi BM25 extends TF-IDF with two corrections that matter in practice: term-frequency saturation and document-length normalization.
Where BM25Similarity defaults to
Two reasons BM25 beats raw TF-IDF:
- Term-frequency saturation.
is concave; doubling occurrences past the saturation point yields diminishing relevance. A term appearing 100 times is not 10× more relevant than 10 times. - Length normalization. A 100-word document matching a term is more relevant than a 1000-word document with the same hit. The
parameter controls how much length is penalised.
BM25 became the default Lucene similarity in Lucene 6.0 (April 2016), replacing the legacy vector-space DefaultSimilarity (now ClassicSimilarity). Elasticsearch and OpenSearch inherit that default.
| Variant | Modification | Use case |
|---|---|---|
| BM25F | Per-field weighted scoring | Structured documents (title, body, anchor text) |
| BM25+ | Floor on the length normalization term | Very short documents where standard BM25 over-penalises |
Learning to rank
BM25 is a fixed formula. Learning to rank (LTR) layers a model on top, using BM25 (and other features) as inputs.
| Approach | Training target | Pros | Cons |
|---|---|---|---|
| Pointwise | Predict an absolute relevance score per (query, doc) | Simple; reuses regression / classification stacks | Ignores ranking context — a doc’s score depends on its neighbours |
| Pairwise | Predict the relative order of (query, doc-A, doc-B) | Directly optimises ranking | O(n²) candidate pairs |
| Listwise | Optimise a list-level metric (e.g. NDCG) directly | Best ranking quality | Complex losses; slower training |
Production systems typically run BM25 (or another lexical signal) as a fast first-pass retriever, then re-rank the top-K with a more expensive model — gradient-boosted trees (XGBoost, LambdaMART) or, increasingly, neural rankers. The natural next step is to retrieve from a second, semantic, index in parallel — see Hybrid retrieval below.
Hybrid retrieval: lexical + dense vector
BM25 retrieves what was typed. Dense vectors retrieve what was meant. Modern relevance stacks run both lanes in parallel, fuse the result lists, then re-rank — because each lane fails on the queries the other handles best.
Why fuse instead of pick
| Query class | BM25 | Dense vector |
|---|---|---|
Exact identifier (SKU-219183, ENOENT) |
Wins — token-exact match | Loses — out-of-vocabulary or paraphrased away |
| Rare technical terms not in the embedding’s training set | Wins | Loses — the encoder has never seen the token |
Semantic / paraphrase (beach house → oceanfront cottage) |
Loses — no token overlap | Wins — vectors are close in embedding space |
| Long natural-language questions | Mixed — IDF dominated by stop-word tail | Wins — sentence-level semantics |
| Short keyword queries with synonyms | Mixed | Wins |
Hybrid retrieval is the empirically observed Pareto front, not a hedge.
Approximate-nearest-neighbour indexes
Exact k-NN over millions of vectors is O(N · d). Production hybrid stacks rely on ANN structures that trade a small recall loss for sublinear search.
| Index | Structure | Memory | Recall@k vs. exact | Build / train | When it shines |
|---|---|---|---|---|---|
| HNSW | Hierarchical small-world graph; greedy descent through layers | High — full graph in RAM | Highest at low latency; best Pareto curve at small/medium scale | No training; incremental insert | Default for ES dense_vector, OpenSearch Lucene engine, FAISS in-memory workloads |
| IVF (Inverted File) | k-means coarse quantizer; probe nprobe clusters |
Medium | Depends on nprobe |
k-means training pass | Disk-friendly; recall tunable per query |
| IVF-PQ | IVF + product quantization on residuals | Very low (32–128× smaller) | Drops noticeably under aggressive PQ | Trained; rebuilt on drift | Billion-scale deployments where RAM is the budget |
| ScaNN | Anisotropic vector quantization tuned for inner-product search | Medium | Top of the curve in Google’s benchmarks | Trained | Inner-product-heavy workloads where ScaNN is available |
Engines:
- FAISS (Meta) — the reference C++/CUDA library. Backs OpenSearch’s k-NN plugin (
faissengine) and many in-house systems. - Lucene HNSW — Lucene’s native HNSW codec. Powers Elasticsearch
dense_vector, OpenSearch’sluceneengine, and Solr’s vector field. - NMSLIB / hnswlib — predecessor to Lucene HNSW; still used in OpenSearch’s
nmslibengine path. - ScaNN (Google) — used in Vertex AI Matching Engine.
Note
HNSW recall is sensitive to the build-time parameters M (graph fan-out, default 16) and efConstruction (default 100–200) and the query-time efSearch (default 100). Higher values trade index size and latency for recall. Tune against a labelled set, not against feel.
Score fusion
The two lanes return lists on incompatible scales — BM25 is unbounded, cosine similarity is in [-1, 1]. Two fusion strategies are mainstream:
- Reciprocal Rank Fusion (RRF). Ignore raw scores; combine by rank:
with typically 60. Robust, requires no calibration, and is the default in Elasticsearch’s rrfretriever (GA in 8.16) and is one of the supported normalizations in OpenSearch’s hybrid query. - Linear combination. Min-max normalise each lane, weight by a tuned
: . Higher ceiling than RRF when you have judgment data; brittle when score distributions drift.
Production guidance: ship RRF first; switch to a tuned linear blend (or a learned fusion) only after you can measure NDCG / MRR on a stable judgment set.
Hybrid in Elasticsearch and OpenSearch
- Elasticsearch combines a
standard(BM25) retriever and aknnretriever inside anrrfretriever; sparse-vector retrieval (ELSER) plugs into the same pipeline. - OpenSearch uses a
hybridquery inside a search pipeline with a normalization processor; engines for the vector lane arelucene,faiss, ornmslib.
The serving pattern is the same: two retrievers, one fuser, optional cross-encoder re-ranker on the survivors. The cross-encoder is intentionally small and per-candidate — running it on every shard’s full index would re-introduce the cost ANN was meant to remove.
Multi-tenant isolation in hybrid stacks
Vector indexes inherit the same multi-tenant trade-offs as the lexical side:
- Per-tenant index — strongest isolation, highest fixed cost (one HNSW graph per tenant). Works up to hundreds of tenants; collapses past that under file-handle and recovery overhead.
- Shared index with tenant filter — one HNSW graph per shard, query-time filter on
tenant_id. Cheapest at scale; naive post-filtering kills recall when the filter is highly selective, so use Lucene’s filtered HNSW path which prunes during graph traversal. Filtered kNN landed in Lucene 9.1 and was further accelerated by ACORN-1 in Lucene 10.2, which Elastic reports as up to 5× faster on selective filters. - Custom routing on tenant key — collapses both the lexical and ANN search to a single shard per tenant; matches the pattern in Custom routing above and is the default for high-tenant-count SaaS workloads.
Near-real-time indexing
The NRT bargain
Classical Lucene only made documents searchable once a commit (fsync of all segments) finished — seconds to minutes of write-to-search latency, depending on workload. Elasticsearch’s near-real-time model exposes uncommitted segments by separating searchability from durability.
| Operation | What happens | Durable? | Searchable? |
|---|---|---|---|
| Refresh | Flush in-memory buffer into a new segment in the filesystem cache | No (filesystem cache only) | Yes |
| Flush | Lucene commit: write segment files, fsync, truncate translog | Yes (Lucene level) | Yes |
| Translog fsync | fsync the translog itself per the configured durability mode | Yes (op-level) | n/a |
Elasticsearch defaults 2:
index.refresh_interval = 1s— but only triggers when the index has received a search in the lastindex.search.idle.after = 30s. Idle indexes stop refreshing until the next search. This is what lets bulk-loaded indexes avoid a “merge storm”.index.translog.flush_threshold_size = 512mb, plus a 30-minute time-based trigger.index.translog.durability = request(fsync every operation),sync_interval = 5swhen set toasync.
Translog mechanics
Every indexing op writes to both the in-memory buffer and the translog. Refreshing every second is cheap; fsyncing every second would not be. The translog absorbs the latency cost: on a crash, the most recent committed Lucene snapshot is reopened and the translog tail is replayed.
Important
index.translog.durability=async trades a window of un-fsync’d ops for indexing throughput. With request (the default), each successful index/bulk op is durable on disk before the response. Pick deliberately — the choice is invisible until a node crashes.
Segment merging
NRT indexing creates many small segments. Without merging, every search opens an ever-growing list of files and per-segment dictionaries.
Lucene’s default TieredMergePolicy groups segments by size tier and merges within a tier once the count reaches a threshold. Out-of-the-box defaults:
| Knob | Default | Effect |
|---|---|---|
maxMergedSegmentMB |
5,120 (5 GB) | Segments above this are not merged further. |
segmentsPerTier |
10.0 | Triggering count per size tier. |
maxMergeAtOnce |
10 | Segments combined per natural merge. |
maxMergeAtOnceExplicit |
30 | Used for forceMerge calls. |
Mike McCandless’s merge visualizations make the steady-state shape vivid: a “staircase” of small recent segments at the bottom, mid-sized segments in the middle, a few large segments capped by maxMergedSegment at the top.
Warning
Force-merging a hot index down to one segment is a common foot-gun. The merged segment exceeds maxMergedSegmentMB and becomes ineligible for future natural merges, so deleted documents accumulate forever inside it. Only force-merge indexes you have stopped writing to.
Distributed query execution
Scatter-gather (query then fetch)
Document-based partitioning forces fanout: every search visits every shard in the routing set. Elasticsearch implements this as the query-then-fetch two-phase pattern.
- Parse. Coordinator parses, validates, and rewrites the query.
- Scatter (query phase). Coordinator forwards the query to every routed shard.
- Local execute. Each shard scores its segments and returns the local top-K with
(doc_id, score, sort_values)only. - Merge. Coordinator does an N-way merge across shard results to produce the global top-K.
- Fetch. Coordinator pulls the full source for only the surviving K documents (potentially from different shards than scored them).
- Reply. Final result returned to the caller.
The two-phase split exists because document source bodies are large compared to (doc_id, score) tuples. Returning all K · num_shards full documents in phase one would balloon network usage by 1–2 orders of magnitude.
The deep-pagination problem
Asking for “page 1000” over S shards with size=10 requires every shard to return its top 10,010 documents — the coordinator cannot know which 10 are correct without merging the top 10,010 from each. Cost is O(from · S · size).
| Approach | Mechanism | Trade-offs |
|---|---|---|
from / size |
Each shard returns from + size docs |
Simplest; capped by index.max_result_window (default 10,000) for exactly this reason |
search_after |
Cursor-based pagination using sort values from the previous page | Stateless and efficient; no random page jumps |
scroll (deprecated for live queries) |
Snapshot a point-in-time view; iterate | Consistent over long iterations; resource-intensive; not for user pagination |
Point-in-time + search_after |
Stateless cursor over a stable view | Modern recommendation in place of scroll |
Caution
“Jump to page 100” UIs are the source of most production deep-pagination outages. Ninety-eighth-percentile users do not need it; design pagination as a cursor and the entire class of failures vanishes.
Adaptive replica selection
When a shard has multiple healthy copies, the coordinator must pick one. Round-robin ignores load. Elasticsearch’s Adaptive Replica Selection — on by default since 6.x — ranks copies by a function of recent service time and outstanding queue depth, biasing traffic away from hot or slow nodes automatically.
Custom routing
By default shard = hash(doc_id) mod num_shards. Custom routing overrides the routing key:
{ "name": "Alice", "tenant_id": "tenant_123"}This collapses queries with the same routing key to a single shard:
- Tenant isolation. All of
tenant_123’s documents on one shard; tenant queries are single-shard. - Co-location of related data. A user and their posts on the same shard for joinful queries.
The price is shard skew: a high-cardinality routing key tail (one whale tenant) creates hot shards. Always model the routing distribution before turning this on.
Elasticsearch and OpenSearch architecture
Node roles
| Role | Responsibilities | Resource profile |
|---|---|---|
| Master | Cluster state, shard allocation, mapping changes | Low CPU/RAM, high reliability — run an odd number ≥ 3 |
| Data | Hold segments, execute queries, index | High disk, RAM, CPU |
| Coordinating | Parse and dispatch requests, merge results | High CPU; moderate RAM |
| Ingest | Document pre-processing pipelines | High CPU |
Elastic’s sizing guidance recommends keeping the JVM heap at ≤ 50% of node RAM and capped near 31 GB to stay inside the JVM’s compressed-OOPs window. Dedicated coordinating nodes pay off mostly above ~50-node clusters.
Primary and replica shards
Every index has N primaries (set at creation pre-7.0; splittable since) and M replicas per primary.
Write path:
- Client sends to any node; that node routes to the primary.
- Primary indexes the doc in-memory and appends to its translog.
- Primary forwards to in-sync replicas.
- Replicas apply, acknowledge.
- Primary acknowledges to the client (consistency level controlled by
wait_for_active_shards).
Read path:
- Coordinator routes to one copy per shard (primary or replica).
- Adaptive replica selection picks the healthiest copy.
- Each copy contributes its local top-K; coordinator merges.
Elasticsearch tracks an “in-sync” set per shard; only in-sync replicas can be promoted on primary failure.
Segment replication (OpenSearch)
Default replication is document replication: each replica re-indexes every document, paying the analyzer / tokenizer / posting-list-build cost N times.
OpenSearch’s segment replication, GA in OpenSearch 2.7 (released May 2 2023), inverts the model: only the primary indexes; replicas pull and apply segment files. The original design proposal reported approximately 40–45% lower CPU/memory on replicas and 50%+ higher indexing throughput on early benchmarks; the GA blog post and docs report indexing-throughput gains in the 9–68% range depending on dataset, replica count, and shard size.
The trade-off is network bandwidth: segment files are larger than the operation stream. The choice favours segment replication when indexing is CPU-bound (rich analyzers, many fields, vector indexing) and document replication when network is the constraint.
OpenSearch vs. Elasticsearch — licensing and feature drift
Timeline:
- January 2021: Elastic moves Elasticsearch and Kibana from Apache 2.0 to dual SSPL / Elastic License v2.
- April 2021: AWS announces and launches OpenSearch, a fork of Elasticsearch 7.10.2 / Kibana 7.10.2 under Apache 2.0.
- August / September 2024: Elastic adds AGPLv3 as a third licensing option for Elasticsearch and Kibana.
- September 2024: AWS donates OpenSearch to the Linux Foundation as the OpenSearch Software Foundation.
| Dimension | Elasticsearch | OpenSearch |
|---|---|---|
| License | SSPL + ELv2 + AGPLv3 (since 2024) | Apache 2.0 |
| Replication | Document replication | Document or segment replication |
| Security | Free basic auth; advanced features in paid tiers | Built-in security plugin, free |
| Vector search | Built-in dense_vector field (HNSW) |
k-NN plugin (FAISS/Nmslib backends) |
| Governance | Vendor-controlled (Elastic NV) | Foundation-controlled (Linux Foundation) |
Feature parity has drifted measurably since 2021; pin your decision to the workload (vector search vs. lexical, license constraints, plugin ecosystem) rather than to the brand.
Real-world case studies
Twitter Earlybird — real-time tweet search
The Earlybird paper (Busch et al., SIGIR 2012) describes Twitter’s first real-time search engine: a Lucene fork tuned for write throughput and sub-10-second indexing latency. Headline numbers from the paper: ~50 ms average query latency and over two billion queries per day, with end-to-end indexing latency around 10 seconds. Twitter’s 2014 “Building a complete tweet index” post extended that to a full historical archive going back to 2006 — described in the post as “roughly half a trillion documents” with “average latency under 100ms.” The current incarnation lives under twitter/the-algorithm/src/java/com/twitter/search and still runs Lucene, now split into real-time, protected, and archive clusters.
The interesting engineering choices:
- In-memory inverted indexes for recent tweets to avoid the OS-page-cache warm-up cost.
- Temporal partitioning. Earlybird handles the recent window; archive clusters handle history.
- Merge policy tuned for high write velocity. Newest segments merge aggressively to keep search-time fanout bounded.
Uber — search platform evolution
Uber’s evolution of the search platform describes building Sia (later “LucenePlus”), an in-house Lucene-based engine, to overcome Elasticsearch’s blocking commit semantics for true concurrent read/write workloads, and decoupling write ingestion (Kafka / Flink) from the read path. Uber’s Lucene version-upgrade post notes the platform supports more than 30 internal use cases. More recent posts describe migration toward OpenSearch and a native gRPC contribution to replace REST/JSON serialization. Geospatial routing uses Uber’s H3 hexagonal hierarchical index, which gives more uniform area cells than rectangular bounding boxes — particularly important when partitioning marketplace data by neighbourhood.
Note
Headline scale numbers (“X billion documents, Y million writes/sec”) drift quickly across Uber’s posts and aggregator summaries. Treat any specific figure with skepticism unless it appears in a single primary post; the architectural patterns are stable, the numbers are not.
DoorDash — custom Lucene engine with segment replication
DoorDash’s introduction post describes a custom Lucene engine with strict indexer / searcher separation: a single indexer service writes segments to S3, replicated searcher services pull and serve them. The migration delivered a reported 50% p99.9 latency reduction and a 75% hardware cost decrease versus their prior Elasticsearch deployment. A 2025 optimization follow-up reports another ~30% p99.9 win and ~30% hardware reduction from a Lucene 10.2 upgrade and a Shenandoah → G1 GC switch.
The architectural lesson is the same as OpenSearch’s: when indexing is CPU-bound, segment replication beats document replication, and the storage layer (S3) becomes the synchronization primitive.
Yelp Nrtsearch
Yelp’s Nrtsearch post describes a Lucene-based gRPC search engine they built to replace Elasticsearch for cost and operability reasons. The design choices echo DoorDash’s: native segment replication, decoupled indexer/searcher, gRPC over REST. They report migrating most of Yelp’s search traffic onto Nrtsearch with substantial cost savings.
Airbnb — embedding-based retrieval
Airbnb’s embedding-based retrieval post describes a two-tower neural network that encodes queries and listings into a shared vector space; approximate-nearest-neighbour (ANN) search retrieves candidates that are then blended with traditional BM25 scores. The hybrid is the production reality: lexical search for exact-token queries, vector search for semantic ones (“beach house” matching “oceanfront cottage”), and a re-ranker on top.
Common pitfalls
Over-sharding for “future growth”
Creating too many shards up front because shards are not splittable in older Elasticsearch versions. Each shard carries fixed overhead — heap, file handles, cluster-state entry — so 1,000 shards × 20 replicas = 20,000 shard copies the master must track. Start with the right shard count for current data volume and either use shard splitting (ES 7+) or a rollover strategy as you grow.
Mapping explosion
Dynamic mapping over user-controlled JSON keys creates a new mapping entry per unique field. Cluster state lives in memory on every master and gets replicated on every change, so 100k+ fields will eventually hang the master. Disable dynamic mapping, use the flattened field for arbitrary keys, or set index.mapping.total_fields.limit to a sane ceiling.
Deep pagination in user-facing UIs
Allowing from=10000&size=100 is the canonical way to OOM a coordinator. Default max_result_window is 10,000 for that reason. Replace with cursor pagination (search_after on a stable sort, ideally backed by a point-in-time).
Forgetting to relax refresh_interval during bulk loads
Bulk loading at the default 1-second refresh interval creates a segment-per-second on every active shard. Within minutes you have a segment count that triggers continuous merges. Either set refresh_interval = -1 for the duration of the bulk and refresh manually at the end, or stretch it to 30s for sustained write-heavy workloads.
Ignoring segment count in monitoring
Merges run in the background; the symptom (slow search, high I/O) is downstream. Always alert on indices.segments.count per shard and on merge throttling. After a one-shot bulk load, force-merge to a sensible target — but never against an index you’re still writing to (see the warning in Segment merging).
Practical takeaways
- Default to document-based partitioning. Term-based partitioning’s theoretical fanout savings do not survive write amplification and zipfian shard skew.
- BM25 is your starting line. Tune
and on real judgment data before reaching for LTR; reach for LTR before reaching for embeddings; reach for hybrid (BM25 + vector + rerank) when the queries that matter most are semantic, not lexical, and you can measure the lift on a labelled set. - Keep the NRT trichotomy in your head. Refresh makes documents searchable; flush makes them durable; translog fsync makes individual ops durable. Pick the matching durability mode for your workload — and never assume the defaults.
- Treat segments as the unit of work. Sharding, replication, merging, snapshotting all happen at segment granularity. Numbers you should know:
segments_per_tier=10,max_merged_segment=5GB, defaultrefresh_interval=1s,translog.flush_threshold_size=512mb,max_result_window=10000. - Optimise the read path with cheap moves first. Adaptive replica selection (free), custom routing for tenant-locked queries, cursor pagination, and avoiding
from + sizepast 10,000 fix more incidents than re-architecting your cluster. - Custom search engines pay back at scale. Twitter, Uber, DoorDash, Yelp built their own not because Lucene/Elasticsearch are bad, but because at their scale a specific bottleneck — usually document replication, commit semantics, or operability — becomes existential. For everyone else, managed Elasticsearch or OpenSearch is the right default.
Appendix
Prerequisites
- Comfort with B-tree / hash-index trade-offs.
- Familiarity with distributed-systems primitives (partitioning, replication, quorum, write-ahead logs).
- Working knowledge of basic information retrieval (term frequency, document frequency, precision/recall).
Glossary
- Inverted index —
term → docs containing termmapping; the core read-path data structure. - Posting list — ordered, often delta- and bit-packed list of
(doc_id, tf, [positions], [payloads])for one term. - Segment — immutable Lucene mini-index. The unit of write, search, replication, and merge.
- Shard — a single Lucene index. ES/OS distribute data across primary shards (each replicated M times).
- Coordinator — node that parses, scatters, and merges a search.
- Translog — per-shard write-ahead log that decouples searchability from durability.
- TF-IDF — term-frequency × inverse-document-frequency lexical scoring.
- BM25 — Okapi Best Match 25; default Lucene similarity since 6.0.
- NRT — near-real-time; documents searchable within ~1 s via refresh, before fsync.
- FST — finite state transducer; Lucene’s compact term-dictionary structure.
- Scatter-gather — distributed query pattern: scatter to all routed shards, gather and merge top-K.
- ANN — approximate nearest neighbour; sublinear vector retrieval (HNSW, IVF, IVF-PQ, ScaNN).
- HNSW — Hierarchical Navigable Small World graph; the default ANN index in Lucene, Elasticsearch, and OpenSearch’s
luceneengine. - RRF — Reciprocal Rank Fusion; rank-based score fusion across heterogeneous retrievers, default in Elasticsearch’s
rrfretriever. - Hybrid retrieval — combining a lexical retriever (BM25) and a dense / sparse vector retriever, fused (RRF / linear) and optionally re-ranked.
References
Foundational papers
- Robertson, S. & Zaragoza, H. — The Probabilistic Relevance Framework: BM25 and Beyond, Foundations and Trends in Information Retrieval, 2009.
- Pibiri, G. & Venturini, R. — Techniques for Inverted Index Compression, ACM Computing Surveys, 2020.
- Spark Jones, K. — A statistical interpretation of term specificity and its application in retrieval (cited within the Robertson & Zaragoza review), Journal of Documentation, 1972.
- Busch, M. et al. — Earlybird: Real-Time Search at Twitter, SIGIR 2012.
Official documentation
- Apache Lucene API documentation —
BM25Similarity,TieredMergePolicy,Lucene90PostingsFormat. - Elasticsearch reference — index settings, translog, search APIs.
- OpenSearch documentation — segment replication, shard allocation, k-NN.
- OpenSearch — k-NN methods and engines — HNSW, IVF, PQ, Lucene/FAISS/NMSLIB engine matrix.
- FAISS wiki — index types, training, GPU acceleration.
- ScaNN paper (Guo et al., ICML 2020) — anisotropic vector quantization.
Engineering blogs
- Mike McCandless — Using Finite State Transducers in Lucene; Visualizing Lucene’s Segment Merges.
- Twitter — Building a complete Tweet index.
- Uber — Evolution of Uber’s Search Platform; Powering Billion-Scale Vector Search with OpenSearch.
- DoorDash — Introducing DoorDash’s in-house search engine; Optimizing DoorDash’s in-house search engine.
- Yelp — Nrtsearch.
- Airbnb — Improving Deep Learning for Ranking Stays.
- OpenSearch — Segment replication, GA in 2.7.
- Elastic — License update; Elasticsearch is open source again.
Textbooks
- Manning, C., Raghavan, P., & Schütze, H. — Introduction to Information Retrieval, Cambridge University Press, 2008. The standard IR textbook; freely available online.
Footnotes
-
Document-based partitioning is the default in Elasticsearch, OpenSearch, and Apache Solr; Twitter’s Earlybird paper explicitly hashes tweets across shards by tweet ID. Term-based partitioning has been extensively studied — see Pibiri & Venturini (2020) for the theoretical comparison — but does not appear in any production system at the scale of those above. ↩
-
Sourced from the current Elasticsearch index-settings reference, translog settings, and search-idle behavior. Defaults occasionally shift between major versions (
flush_threshold_sizewas the most recent example) — re-verify against the version you actually run. ↩