Design a Distributed Key-Value Store

A distributed key-value store offers minimal get/put/delete semantics while hiding partitioning, replication, failure detection, and storage-engine mechanics behind a simple API. This article walks the design space through the Dynamo lineage¹ (Amazon Dynamo, Apache Cassandra, Riak)---availability- and partition-tolerance-first systems with tunable consistency---and contrasts with CP alternatives like etcd where linearizability is the product. The focus is the non-obvious mechanisms a senior engineer needs to defend in a design review: how consistent hashing actually places replicas, why R + W > N is necessary but not sufficient, where vector clocks beat last-write-wins, and how an LSM engine’s compaction strategy decides whether your reads will fly or thrash.

High-level architecture: clients contact any node as coordinator, which routes to the correct replicas based on consistent hashing. Each node uses an LSM-tree storage engine.

Mental model

A distributed key-value store is fundamentally about choosing where to sit on the CAP spectrum²³ and then implementing the mechanisms to deliver that choice end-to-end:

AP systems (Dynamo, Cassandra, Riak): accept eventual consistency in exchange for always-on availability. Leaderless replication, sloppy / strict quorums, and conflict resolution via vector clocks or LWW.
CP systems (etcd, Consul, ZooKeeper): unavailable during partitions to preserve linearizability. Leader-based consensus (Raft / Multi-Paxos / ZAB).

Important

CAP is a worst-case partition statement, not a steady-state one. AP systems still offer strong consistency under quorum (R + W > N) when the network is healthy; CP systems still serve reads with low latency when the leader is reachable. The split is about what happens when the network breaks, not about average behavior.

The five mechanisms that recur in every Dynamo-style design:

Consistent hashing with virtual nodes distributes data so that only k/n keys move when topology changes.⁴
Quorum replication (R + W > N) gives the operator a per-call lever between availability and consistency.
Vector clocks or LWW timestamps detect and resolve concurrent writes.
Gossip + an accrual failure detector propagate cluster membership and suspect liveness without a central coordinator.⁵⁶
LSM-tree storage turns random writes into sequential I/O at the cost of read amplification you then claw back with bloom filters and compaction.⁷

Throughout the article, “the Dynamo paper” refers to the 2007 SOSP paper¹; “DynamoDB” refers to the AWS service whose 2022 USENIX ATC paper⁸ documents how its current architecture has diverged from that paper (notably toward Multi-Paxos-based replication and strong consistency as an option).

Requirements

Functional Requirements

Requirement	Priority	Notes
`put(key, value)`	Core	Store a value, return success/version
`get(key)`	Core	Retrieve value(s), handle conflicts
`delete(key)`	Core	Tombstone-based deletion
Range queries	Extended	Only if ordered storage (not covered in AP designs)
TTL expiration	Extended	Automatic key expiry
Transactions	Out of scope	Requires coordination, changes CAP position

Non-Functional Requirements

Requirement	Target	Rationale
Availability	99.99%	Writes must succeed even during failures
Write latency	p99 < 10ms	Local disk + async replication
Read latency	p99 < 5ms	Cache hits, single disk seek
Throughput	100K+ ops/sec per node	LSM-tree optimized for writes
Durability	No acknowledged write lost	WAL before acknowledgment
Consistency	Tunable (eventual to strong)	Application chooses per-operation

Scale Estimation

Cluster sizing example for a 10 TB dataset:

1Data size: 10 TB2Replication factor: 33Total storage needed: 30 TB45Per-node capacity: 2 TB (leaving headroom for compaction)6Nodes required: 30 TB / 2 TB = 15 nodes78Traffic assumptions:9- 80% reads, 20% writes10- Average value size: 1 KB11- Target: 100K ops/sec total1213Per-node throughput: 100K / 15 ~ 6,700 ops/sec14- Reads:  5,400 ops/sec15- Writes: 1,300 ops/sec

Design Paths

Path A: AP with Leaderless Replication (Dynamo Model)

Best when:

Availability is paramount (e-commerce carts, session stores)
Application can handle conflict resolution
Writes must succeed even during network partitions

Architecture:

All nodes are peers—no leader election
Any node can coordinate any request
Replication is synchronous to quorum, async to remaining replicas
Conflicts detected via vector clocks or resolved via LWW

Trade-offs:

Writes always succeed (to any available quorum)
No single point of failure
Application must handle conflicting versions
Weaker consistency guarantees

Real-world examples: Amazon Dynamo (shopping cart), Riak, Cassandra (with eventual consistency)

Path B: CP with Leader-Based Consensus (Raft/Paxos)

Best when:

Strong consistency required (configuration stores, coordination)
Reads must return the latest write
Can tolerate unavailability during leader election

Architecture:

Single leader handles all writes
Raft/Paxos ensures log replication before acknowledgment
Leader election on failure (typically 1-10 seconds)

Trade-offs:

Linearizable reads and writes
Unavailable during leader election
Write throughput limited by leader
Simpler conflict model (no concurrent writes)

Real-world examples: etcd (Kubernetes), Consul, ZooKeeper

Path Comparison

Factor	AP (Dynamo)	CP (Raft)
Write availability	Always (to quorum)	Unavailable during election
Read consistency	Eventual or quorum	Linearizable
Conflict handling	Vector clocks/LWW	None (single writer)
Latency	Lower (no consensus)	Higher (consensus round-trip)
Throughput	Higher (any node writes)	Lower (leader bottleneck)
Cluster size	100s-1000s nodes	3-7 nodes typical
Use case	User data, caches	Config, coordination, locks

This Article’s Focus

This article focuses on Path A (AP/Dynamo model) because:

Most key-value workloads prioritize availability over strong consistency
The techniques (consistent hashing, vector clocks, gossip) are more complex and worth detailed examination
CP systems (etcd, Consul) have well-documented Raft implementations

For CP key-value store design, see etcd’s architecture documentation and the Raft paper.

High-Level Design

Component Overview

Component interactions: coordinator routes requests, gossip maintains membership, storage engine persists data, anti-entropy mechanisms ensure replica convergence.

Request Flow

Write path:

Client SDK hashes key, identifies coordinator node
Coordinator determines N replica nodes from preference list
Coordinator sends write to all N replicas in parallel
Each replica: writes to WAL → updates memtable → acknowledges
Coordinator waits for W acknowledgments
Returns success to client (remaining replicas receive async)

Read path:

Client SDK hashes key, contacts coordinator
Coordinator sends read to all N replicas in parallel
Coordinator waits for R responses
If versions conflict: return all versions (or resolve via LWW)
Trigger read repair if replicas diverged

Data Partitioning

Consistent Hashing

Consistent hashing⁴ maps both keys and nodes to positions on a hash ring (typically 0 to using MD5 / Murmur3, or using xxHash). A key is stored on the first N distinct physical nodes walked clockwise from its hash position.

Hash ring with virtual nodes: a key's hash lands on a vnode, and the preference list is built by walking clockwise and skipping vnodes whose physical node is already in the list. — Hash ring with virtual nodes. The preference list for a key is built by walking clockwise from the key's hash position and skipping any vnode whose physical node is already in the list, so the N replicas always sit on N distinct machines.

Why consistent hashing?

When nodes join or leave, only keys need to move (k = total keys, n = nodes). With naive modulo hashing, nearly all keys would remap.

1Traditional: hash(key) % num_nodes  -> Node changes cause ~100% key movement2Consistent:  next_node(hash(key))   -> Node changes cause ~1/n key movement

Virtual Nodes (vnodes)

Physical nodes own multiple positions on the ring. Each position is a “virtual node” responsible for a range of the hash space.

Design rationale:

Load balancing: A single physical node token can create hotspots if keys cluster. Virtual nodes spread load.
Heterogeneous hardware: Assign more vnodes to powerful machines.
Faster recovery: When a node fails, its vnodes are distributed across many physical nodes, enabling parallel recovery.

Configuration trade-offs:

vnodes per node	Pros	Cons
1 (legacy)	Fewer ring neighbors, simpler	Uneven distribution, slow rebalancing
16 (modern default)	Good balance, deterministic allocation	Moderate neighbor count
256 (legacy Cassandra)	Fine-grained distribution	High memory overhead, slow streaming

Cassandra 4.0+ defaults num_tokens to 16 with the replica-aware token allocator (allocate_tokens_for_local_replication_factor) enabled, down from 256 random tokens in 2.0–3.x.⁹ The reduction improves repair and streaming performance while keeping the distribution balanced. The change is tracked in CASSANDRA-13701.

Replication Strategy

Keys are replicated to N consecutive nodes on the ring (the “preference list”). With virtual nodes, consecutive ring positions may map to the same physical node, so the preference list skips to ensure N distinct physical nodes.

Replication factor selection:

RF	Fault tolerance	Storage overhead	Typical use
1	None	1x	Caches, ephemeral data
3	1 node failure	3x	Standard production
5	2 node failures	5x	Critical data, cross-DC

Hot Key and Hot Partition Mitigation

Consistent hashing balances the key space, not the request space. A single very popular key (or a single fat partition under a wide-row schema) still pins all reads/writes to one preference list of N replicas, capping throughput at one machine’s IOPS regardless of cluster size. The mitigations split into client-side and server-side flavours.

Schema-level (Cassandra-style, manual). Add a bucket / salt component to the partition key — (natural_key, bucket) where bucket = hash(payload) mod K or a time-rounded suffix.¹⁰ Reads then scatter-gather across K partitions, trading extra coordinator work for parallelism across N × K replicas. Pick K against the observed skew, not the cluster size; oversharding wastes coordinator round-trips on cold keys.

Read-side coalescing and caching. Since the same hot key is being requested concurrently, single-flight the read at the coordinator (or in the client SDK) so R replica reads are issued once per in-flight wave instead of once per client request. A short TTL local cache in front of the coordinator (or a dedicated edge cache like Redis) absorbs the rest.

Adaptive partitioning (managed-service style). DynamoDB’s adaptive capacity and split-for-heat automatically isolate a hot item or a hot partition into its own physical partition with elevated throughput.¹¹ The catch: it cannot fix monotonically increasing keys (e.g., created_at-only partitioning) because splitting still leaves all writes on the newest shard. Self-hosted Dynamo-style stores rarely ship this; they rely on the schema-level fix.

Operational signals. Watch p99 per coordinator/replica, per-table compaction throughput, and per-partition tombstone counts. Cassandra surfaces hot partitions via the MaxPartitionSizeInBytes and tracing subsystems; DynamoDB exposes HotKey insights in CloudWatch Contributor Insights.

Multi-datacenter replication:

Cassandra’s NetworkTopologyStrategy places replicas across racks and datacenters:

1Replication settings:2  dc1: 3 replicas (across 3 racks)3  dc2: 3 replicas (across 3 racks)45Total replicas: 66Rack-aware placement prevents correlated failures

Quorum Reads and Writes

Quorum Formula

For a replication factor N, if R (read replicas) + W (write replicas) > N, the read and write quorums must intersect at at least one node:

1R + W > N  ->  At least one node in the read set acked the write23Example with N=3:4- R=2, W=2: Standard quorum (R+W=4 > 3)5- R=1, W=3: Write-heavy  (all replicas must ack writes)6- R=3, W=1: Read-heavy   (fast writes, "consistent" reads)

Quorum overlap with N=3, W=2, R=2: the write and read sets share at least one replica, so any successful read contacts a node that acknowledged the latest write.

Caution

R + W > N only guarantees the latest acked write is visible. It does not give you linearizability: two clients writing concurrently can both succeed against overlapping but distinct quorums, leaving the system with siblings that are surfaced to the next reader. Conflict resolution (vector clocks or LWW) is what closes that gap.

Consistency Levels (Cassandra Model)

Level	Nodes contacted	Use case
ONE	1	Lowest latency, highest availability
QUORUM	⌊N/2⌋ + 1	Standard consistency
LOCAL_QUORUM	⌊local_N/2⌋ + 1	Cross-DC deployments
ALL	N	Strongest consistency, lowest availability

Operational guidance:

Use QUORUM for most operations
Use LOCAL_QUORUM for latency-sensitive cross-DC reads
Avoid ALL in production (single node failure blocks operations)
ONE is acceptable for time-series data where some loss is tolerable

Sloppy Quorum and Hinted Handoff

Problem: Strict quorum requires W of the N designated replicas to acknowledge. If one is down and you only have N-1 reachable replicas, a W = N-1 write still works; if W = N, the write fails.

Dynamo’s sloppy quorum: when a designated replica is unreachable, the coordinator writes to the next healthy node on the ring with a “hint” to forward later, and counts that write toward W.¹ Availability wins; the trade-off is that the temporary holder is not in the read preference list, so a subsequent read may miss the write.

Sloppy quorum: when replica B is unavailable, the coordinator writes to node D with a hint. When B recovers, D forwards the data.

Important

Cassandra implements hinted handoff but uses strict quorum: hints are stored after the consistency level is satisfied and do not count toward W. The only level that lets a hint substitute for a real replica acknowledgment is CL=ANY, where any node in the cluster (including a hint holder) can satisfy the write.¹² In other words, Cassandra’s QUORUM is closer to a strict quorum + best-effort backup than to Dynamo’s classic sloppy quorum.

Hint storage limits. Cassandra defaults max_hint_window to 3 hours (max_hint_window_in_ms = 10800000).¹² Hints for replicas down longer than this are discarded; restoring consistency then requires full anti-entropy repair. This bound is what prevents unbounded hint accumulation during long outages — and is also why operators monitor the PendingHintsByEndpoint metric.

Trade-off. Sloppy quorum (Dynamo-style) improves availability but temporarily lifts the quorum guarantee. Strict quorum (Cassandra-style) preserves the guarantee at the cost of failing writes when too few designated replicas are reachable. Pick deliberately; both ship.

Conflict Detection and Resolution

The Concurrent Write Problem

Without a single leader, two clients can write to the same key simultaneously via different coordinators. Both writes may succeed (each reaching W replicas), but replicas now have different values.

Vector Clocks

Vector clocks track causal relationships between versions. Each write increments a (node, counter) pair, and the context a client sends with a write is the version it most recently observed.¹

1Initial state: {} (empty)23Client A writes via Node1: [(Node1, 1)]4Client B reads [(Node1, 1)], writes via Node2: [(Node1, 1), (Node2, 1)]5Client C reads [(Node1, 1)], writes via Node3: [(Node1, 1), (Node3, 1)]67Now we have concurrent versions:8  V1: [(Node1, 1), (Node2, 1)]  - Client B's write9  V2: [(Node1, 1), (Node3, 1)]  - Client C's write1011Neither dominates the other -> SIBLINGS (concurrent)

Two clients read the same version, then write through different coordinators. The resulting vector clocks are concurrent (neither dominates), so the system stores both as siblings until a reader merges them.

Detecting relationships:

V1 dominates V2: every (node, counter) in V2 is ≤ the corresponding entry in V1, and V1 has at least one strictly greater entry. Discard V2.
Concurrent (siblings): neither dominates. Return both versions.

Resolution strategies:

Application-level merge. Return both versions to the client; the application merges (Dynamo’s classic shopping-cart example takes the union of items, which is why deleted items occasionally reappear under partition).¹
Last-Write-Wins (LWW). Use wall-clock timestamps, discard the older version.
CRDTs. Use conflict-free data structures (counters, OR-sets, RGAs) that merge automatically; trades data-model flexibility for automatic convergence.¹³

Vector Clock Truncation

Vector clocks grow unboundedly as more coordinators write to a key. Dynamo truncates at a configurable threshold (the paper reports 10) by dropping the oldest (node, counter) pair based on the auxiliary timestamp it stores per entry.¹

Risk: truncation can drop causal history and cause two causally-related versions to look concurrent, generating spurious siblings. Amazon’s paper reports this rarely produced visible problems in practice because most keys have a small set of recurring writers; “rarely” is doing a lot of work here, and Riak eventually moved to dotted version vectors¹⁴ to avoid the issue altogether.

Last-Write-Wins (LWW)

Cassandra resolves conflicts with microsecond client- or server-supplied timestamps instead of vector clocks. The cell with the highest timestamp wins; ties are broken by comparing the value bytes lexicographically.¹⁵

1Write 1: value="A", timestamp=10002Write 2: value="B", timestamp=100134Resolution: value="B" wins (higher timestamp)

Advantages. Simpler implementation, no vector-clock growth, constant per-cell metadata, and no sibling-merge plumbing for the application.

Risks. LWW assumes globally comparable timestamps, which assumes well-synced clocks. With NTP-typical skew of ~10 ms across cloud regions a “later” write can lose to an earlier one — silently. Daniel Abadi has called this “the great LWW lie” because the system will discard a more-recent write if its clock is behind.

Mitigation. Run NTP with tight synchronization (target sub-millisecond skew within a DC; ~10 ms cross-DC), prefer server-side timestamps unless you have a strong reason for client-side, and reach for a CRDT or a CP store when you cannot tolerate silent loss.¹⁶ Spanner’s TrueTime¹⁷ is the canonical example of bounded-skew clocks — it costs a hardware atomic-clock fleet, which is why most KV stores do not adopt it.

Failure Detection

Gossip Protocol

Nodes exchange state information periodically with random peers (epidemic-style). Information propagates exponentially — reaching all nodes in rounds with high probability, the classic result from Demers et al.⁵

Gossip protocol details:

Every second, each node picks one (or a few) random peers.
They exchange: membership list, heartbeat counters, schema version, application state.
Merge received state with local state, picking the entry with the higher version per key.

Convergence. With n nodes, gossip reaches all nodes in roughly rounds. A 1,000-node cluster converges in ~10 gossip rounds (≈10 seconds at the typical 1-second period). Beyond a few hundred nodes the variance starts to matter — operators tune the gossip fan-out and period to keep tail latency for membership churn bounded.

Gossip membership: each round a node picks a random peer and exchanges a 3-message SYN/ACK/ACK2 of versioned endpoint state; phi accrual on top of those heartbeats decides when to convict a peer as down. — Gossip membership: 3-phase SYN/ACK/ACK2 exchange of versioned endpoint state with a random peer each round; phi accrual on top of the same heartbeats decides when to convict a peer as down.

Phi Accrual Failure Detector

Rather than a binary alive/dead signal, the φ-accrual detector⁶ outputs a continuous “suspicion level” (φ) based on the empirical distribution of inter-heartbeat arrival times:

where is the CDF of observed heartbeat intervals from that peer.

Threshold configuration (phi_convict_threshold in cassandra.yaml, default 8):¹⁸

φ threshold	Meaning	Use case
5	Aggressive detection	Low-latency networks
8	Default	Standard deployments
10-12	Conservative	AWS/cloud (network congestion)

At φ = 8 with a 1-second heartbeat, a node has to be unresponsive for roughly 18 seconds before being convicted.¹⁹ That sounds slow; in practice it is what keeps a network blip in us-east-1 from cascading into a wave of replica failovers across thousands of nodes.

Why phi accrual over fixed timeout? Fixed timeouts have to be tuned per environment and break when the environment changes (autoscaling, region, time-of-day load). Phi accrual adapts to the observed per-peer distribution, so the same threshold ports across very different latency profiles.

Anti-Entropy Mechanisms

Merkle Trees for Replica Synchronization

Merkle trees enable efficient comparison of large datasets. Each leaf is a hash of a data range; internal nodes are hashes of children.

Merkle tree: comparing root hashes identifies if replicas differ. Traversing mismatched branches locates specific divergent key ranges.

Synchronization algorithm:

Compare root hashes between replicas
If equal: replicas are identical
If different: recursively compare child hashes
Only exchange data for leaf nodes with different hashes

Efficiency: Synchronization is O(log n) comparisons, transferring data proportional to differences rather than total size.

Riak’s implementation: Maintains persistent on-disk Merkle trees, regenerated weekly by default. Real-time updates to trees occur as writes happen.

Read Repair

When a read returns divergent values from replicas, the coordinator triggers repair:

Determine winning value (latest vector clock or timestamp)
Asynchronously write winning value to stale replicas
Return result to client (doesn’t block on repair)

Configuration. Cassandra historically used dclocal_read_repair_chance = 0.1 (10% of reads opportunistically trigger repair). Both read_repair_chance and dclocal_read_repair_chance were removed in Cassandra 4.0 (CASSANDRA-13910); read repair is now controlled at the table level via the read_repair option (BLOCKING or NONE).²⁰

Full Anti-Entropy Repair

Background process that:

Builds Merkle tree for each token range
Compares with replica Merkle trees
Streams missing/divergent data

Frequency. Run within gc_grace_seconds (default 864000 = 10 days in Cassandra)²¹ to prevent zombie data resurrection: a node missing the original delete will resurrect the row once the tombstone is GC’d elsewhere if it is not repaired in time.

Storage Engine: LSM Tree

Why LSM Tree for Write-Heavy Workloads

LSM (Log-Structured Merge) trees convert random writes to sequential I/O:

All writes go to in-memory buffer (memtable)
When full, memtable flushes to immutable on-disk file (SSTable)
Background compaction merges SSTables

Trade-off comparison:

Aspect	LSM Tree	B-Tree
Write amplification	10-30x (compaction)	2-4x (page splits)
Read amplification	Higher (multiple SSTables)	Lower (single tree)
Space amplification	Lower (no fragmentation)	Higher (50-67% page fill)
Write throughput	Higher (sequential I/O)	Lower (random I/O)
Read latency	Higher (bloom filters help)	Lower (single lookup)

Write Path Details

Write path: WAL ensures durability, memtable provides fast writes, flush creates immutable SSTables.

Memtable sizing. Cassandra allocates roughly 1/4 of the JVM heap to memtables by default (memtable_heap_space_in_mb).¹⁸ Larger memtables cut flush frequency (fewer SSTables, less compaction pressure) but increase replay time on restart and risk OOM if you also keep a large bloom-filter / row-cache footprint.

Read Path Details

Check the memtable (in-memory; fastest hit).
For each SSTable, newest to oldest, consult the bloom filter; skip if it answers “definitely not present”.
On a “maybe”, read the partition index, then the data block.
Merge versions across SSTables and return the newest (or surface siblings).

LSM read path: the engine checks the memtable, then walks SSTables newest-first, using a bloom filter per SSTable to short-circuit lookups, and merges any versions found. — LSM read path: memtable first, then bloom-filter-guarded lookups across SSTables newest-first. The bloom filter cuts disk seeks for keys that are not present at all; the merge step handles keys that exist in multiple SSTables.

Bloom filter tuning. Cassandra defaults bloom_filter_fp_chance to 0.01 (1% false positive rate), which costs roughly bits per key — a useful number to memorize when sizing memory for hot tables.²² Smaller fp_chance cuts false-positive disk reads at the cost of bloom-filter memory; for cold tables you can raise it (e.g., 0.1) and trade a few extra reads for a lot less RAM.

Compaction Strategies

Compaction merges SSTables to:

Reclaim space from deleted/overwritten keys
Reduce read amplification (fewer files to check)
Enforce tombstone expiration

Strategy comparison:

Strategy	SSTable sizing	Read amp	Write amp	Best for
Size-Tiered (STCS)	Variable buckets	Higher	Lower	Write-heavy
Leveled (LCS)	Fixed 160MB	Lower	Higher	Read-heavy
Time-Window (TWCS)	Time buckets	Moderate	Low	Time-series + TTL

STCS mechanics. Groups SSTables of similar size into buckets. When a bucket contains min_threshold (default 4) files, they are compacted into one larger file.²³ Read amplification grows because a single key can live in multiple tiers; write amplification stays low because each row is rewritten only when its tier compacts.

LCS mechanics. Organizes SSTables into levels (L0, L1, L2, …). Each level holds non-overlapping SSTables of sstable_size_in_mb (default 160 MB), and each level is roughly 10× larger than the previous. The non-overlapping invariant means ~90% of reads touch at most one SSTable per level, dramatically reducing read amplification at the cost of higher write amplification (~10× of STCS in the worst case).²⁴

Note

Cassandra 5.0 (Sep 2024) ships Unified Compaction Strategy (UCS), an adaptive strategy that subsumes STCS, LCS, and TWCS via a single scaling_parameters knob (e.g. T4 mimics STCS, L10 mimics LCS) plus density-based sharding for parallel compactions.²⁵ STCS remains the default in 5.0 for backwards compatibility, but UCS is the recommended target for new tables — it removes the up-front “pick a strategy and live with it” decision that this section documents.

Compaction strategies side-by-side: STCS groups same-size SSTables into tiers, LCS holds non-overlapping SSTables per level with each level ~10x larger than the prior one. — Side-by-side STCS vs LCS layout. STCS keeps same-size SSTables in tiers (low write amp, higher read amp); LCS keeps non-overlapping SSTables per level so reads touch at most one SSTable per level, at the cost of higher write amplification.

API Design

Core Operations

1PUT /kv/{key}2Content-Type: application/octet-stream3X-Consistency-Level: QUORUM4X-Client-Timestamp: 169990000000056<binary value>78HTTP/1.1 201 Created9X-Version: [(node1,5),(node2,3)]

1GET /kv/{key}2X-Consistency-Level: QUORUM34HTTP/1.1 200 OK5X-Version: [(node1,5),(node2,3)]6<binary value>

1GET /kv/{key}2X-Consistency-Level: QUORUM34HTTP/1.1 300 Multiple Choices5Content-Type: multipart/mixed; boundary=siblings67--siblings8X-Version: [(node1,5),(node2,3)]9<value A>10--siblings11X-Version: [(node1,4),(node3,2)]12<value B>13--siblings--

1DELETE /kv/{key}2X-Consistency-Level: QUORUM34HTTP/1.1 204 No Content

Pagination for Key Listing

1GET /kv?prefix=user:&limit=100&cursor=dXNlcjo1MDA=23HTTP/1.1 200 OK4Content-Type: application/json56{7  "keys": ["user:501", "user:502", "..."],8  "next_cursor": "dXNlcjo2MDA=",9  "has_more": true10}

Error Responses

Status	Meaning
400	Invalid key (too long, invalid characters)
404	Key not found
409	Write conflict (for conditional writes)
503	Insufficient replicas available for requested consistency
504	Timeout waiting for replica responses

Infrastructure Design

Cloud-Agnostic Components

Component	Purpose	Options
Compute	Node processes	VMs, containers, bare metal
Block storage	SSTable persistence	Local SSD, network SSD
Object storage	Backups, cold tier	S3-compatible
Load balancer	Client distribution	HAProxy, cloud LB
Service discovery	Node membership	Gossip (built-in), Consul

AWS Reference Architecture

AWS deployment: i3 instances with local NVMe for low-latency storage, spread across 3 AZs for fault tolerance, S3 for backups.

Instance selection (specs per AWS EC2 instance-type docs):

Instance	Storage	Memory	Use case
i3.xlarge	1× 950 GB NVMe SSD	30.5 GB	Standard nodes
i3.2xlarge	1× 1.9 TB NVMe SSD	61 GB	High-capacity nodes
r5.xlarge + gp3	EBS	32 GB	Lower cost, higher and noisier latency

Why i3 / i4i instances? Local NVMe gives consistent sub-millisecond IOPS latency, which dominates p99 in a write-heavy LSM workload. EBS adds a network round-trip and is throttled per volume, so p99 grows under bursts even when the instance has spare CPU. The newer i4i family is the modern recommendation for Cassandra/ScyllaDB and trades the same trade-off with better $/GB.

Managed Alternatives

Build vs Buy	Option	Trade-off
Self-hosted	Cassandra, ScyllaDB, Riak	Full control, operational burden
Managed	Amazon DynamoDB	No ops, vendor lock-in, cost at scale
Managed	Azure Cosmos DB	Multi-model, global distribution
Managed	DataStax Astra	Managed Cassandra, Cassandra compatibility

Note

DynamoDB ≠ Dynamo paper. Despite the name, AWS DynamoDB has diverged significantly from the 2007 Dynamo design. The 2022 USENIX ATC paper “Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service”⁸ documents the current architecture: Multi-Paxos-based replication with a partition leader, strong consistency as a per-request option, transactional API, and an autoadmin control plane. The Dynamo lineage in this article (leaderless, sloppy/strict quorums, vector clocks) maps to Cassandra and Riak today, not to current DynamoDB.

Conclusion

Designing a distributed key-value store requires explicit CAP positioning. This design chose AP (availability + partition tolerance) with tunable consistency, following the Dynamo lineage:

Key architectural decisions:

Consistent hashing with vnodes for incremental scaling and load distribution.
Quorum replication (N = 3, R = W = 2 default) for per-call consistency tuning.
Strict quorum + hinted handoff (Cassandra-style) for availability during transient failures, with sloppy quorum (Dynamo-style) as an alternative when availability outranks consistency.
LWW timestamps for conflict resolution by default (simpler than vector clocks, but only safe with tightly-synced clocks); vector clocks or CRDTs when silent loss is unacceptable.
LSM-tree storage for write-optimized performance, with bloom filters and compaction strategy chosen against the read/write mix.

What this design sacrifices:

Strong consistency (use etcd/Consul if required)
Range queries (add secondary index or use ordered storage like Bigtable)
Multi-key transactions (requires coordination, changes CAP position)

When to choose this design:

Session stores, shopping carts, user preferences
Cache layers with persistence
Time-series data (with TWCS compaction)
Any workload where availability > consistency

Appendix

Prerequisites

Distributed systems fundamentals: CAP theorem, consistency models
Storage concepts: B-trees, write-ahead logging, compaction
Networking: gossip protocols, failure detection

Terminology

Term	Definition
Consistent hashing	Hash function mapping keys and nodes to a ring, minimizing key movement on topology changes
Vector clock	List of (node, counter) pairs tracking causal ordering between versions
Quorum	Minimum replicas (R or W) that must respond for an operation to succeed
Sloppy quorum	Dynamo-style: quorum satisfied by any healthy nodes, including substitutes outside the preference list. Cassandra’s `QUORUM` is strict — see “Sloppy Quorum” section.
Hinted handoff	Temporary storage of writes for unavailable replicas, forwarded on recovery
SSTable	Sorted String Table—immutable, sorted key-value file on disk
Memtable	In-memory buffer for recent writes, flushed to SSTables periodically
Compaction	Background process merging SSTables to reclaim space and reduce read amplification
Tombstone	Marker indicating a deleted key, expires after gc_grace_seconds
Anti-entropy	Background synchronization to repair replica divergence

Summary

Distributed KV stores sit on a CAP spectrum: AP (Dynamo model) vs CP (Raft model)
Consistent hashing + vnodes enables horizontal scaling with minimal data movement
Quorum replication (R + W > N) provides tunable consistency
Conflict resolution via vector clocks (causal tracking) or LWW (timestamp-based)
Gossip + phi accrual failure detector maintains cluster membership
LSM-tree storage optimizes write throughput; compaction strategy choice depends on workload
Sloppy quorum + hinted handoff + Merkle tree repair ensure eventual convergence

References

Amazon Dynamo Paper (SOSP 2007) — original Dynamo design, sloppy quorum, vector clocks.
Amazon DynamoDB (USENIX ATC 2022) — current DynamoDB architecture; Multi-Paxos, leader-based.
Apache Cassandra documentation — cassandra.yaml, hints, repair, compaction strategies.
Apache Cassandra: dynamo / partitioning architecture — Cassandra’s adaptation of the Dynamo design.
Google Bigtable Paper (OSDI 2006) — origin of the SSTable format.
Raft Consensus Paper — leader-based consensus for CP KV stores.
Redis Cluster Specification — hash-slot partitioning as an alternative to consistent hashing.
etcd architecture overview — Raft-based KV store.
Riak documentation — active anti-entropy, dotted version vectors.
The Log-Structured Merge-Tree (O’Neill et al., 1996) — original LSM-tree paper.
Karger et al., “Consistent Hashing and Random Trees” (STOC 1997) — foundational consistent-hashing result.
Hayashibara et al., “The φ Accrual Failure Detector” (SRDS 2004).
Demers et al., “Epidemic Algorithms for Replicated Database Maintenance” (PODC 1987).
Shapiro et al., “Conflict-free Replicated Data Types” (INRIA tech report, 2011).
LSM Tree vs B-Tree Analysis (TiKV) — storage engine trade-offs.

DeCandia et al., Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Brewer, Towards Robust Distributed Systems, PODC 2000 keynote. ↩
Gilbert and Lynch, Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, SIGACT 2002. ↩
Karger et al., Consistent Hashing and Random Trees, STOC 1997. ↩ ↩²
Demers et al., Epidemic Algorithms for Replicated Database Maintenance, PODC 1987. ↩ ↩²
Hayashibara, Defago, Yared, Katayama, The φ Accrual Failure Detector, SRDS 2004. ↩ ↩²
O’Neill et al., The Log-Structured Merge-Tree (LSM-Tree), Acta Informatica 1996. ↩
Elhemali et al., Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service, USENIX ATC 2022. ↩ ↩²
TheLastPickle, The Impacts of Changing the Number of VNodes in Apache Cassandra, 2021. ↩
Apache Cassandra docs, Data modeling — partition size guidelines; AWS re:Post, Identifying and Resolving Hot Partition Issues in Amazon Keyspaces. ↩
AWS docs, DynamoDB burst and adaptive capacity; AWS Database Blog, Scaling DynamoDB: How partitions, hot keys, and split for heat impact performance. ↩
Apache Cassandra docs, Hinted Handoff. ↩ ↩²
Shapiro, Preguiça, Baquero, Zawirski, Conflict-free Replicated Data Types, INRIA RR-7687, 2011. ↩
Preguiça, Baquero, Almeida et al., Brief Announcement: Efficient Causality Tracking in Distributed Storage Systems with Dotted Version Vectors, DAIS 2012. ↩
Apache Cassandra docs, Storage engine: write path. ↩
Abadi, The dangers of replication and a solution, 2020 — caveats on LWW under clock skew. ↩
Corbett et al., Spanner: Google’s Globally-Distributed Database, OSDI 2012. ↩
Apache Cassandra docs, cassandra.yaml configuration. ↩ ↩²
Digitalis, Understanding phi_convict_threshold in Apache Cassandra. ↩
CASSANDRA-13910 — Remove read_repair_chance / dclocal_read_repair_chance. ↩
Apache Cassandra docs, Compaction and tombstones. ↩
Apache Cassandra docs, Bloom filters. ↩
Apache Cassandra docs, Size-Tiered Compaction Strategy. ↩
Apache Cassandra docs, Leveled Compaction Strategy. ↩
Apache Cassandra docs, Unified Compaction Strategy (UCS); see also the Apache Cassandra 5.0 UCS feature post. ↩