System Design Fundamentals
18 min read

Failure Modes and Resilience Patterns

Distributed systems fail in complex, often surprising ways. This article covers the taxonomy of failures—from clean crashes to insidious gray failures—and the resilience patterns that mitigate them. The focus is on design decisions: when to use each pattern, how to tune parameters, and what trade-offs you’re accepting.

Blast Radius

Cell Architecture

Shuffle Sharding

Zone Isolation

Resilience Patterns

Retries + Backoff

Circuit Breaker

Bulkhead

Load Shedding

Fallback

Detection

Heartbeats

Timeouts

Health Checks

Differential Observability

Failure Modes

Crash Failure

Byzantine Failure

Omission Failure

Timing Failure

Gray Failure

Failure modes flow through detection mechanisms to mitigation patterns, with blast radius isolation as the outer defense layer.

Failures in distributed systems form a spectrum from fail-stop (clean crash, easy to detect) to Byzantine (arbitrary behavior, hard to detect). The most dangerous are gray failures: partial degradation that passes health checks but causes cascading impact. Detection requires differential observability—comparing perspectives across system components.

Resilience patterns form a layered defense:

  1. Timeouts bound waiting time but require careful tuning (too short = false positives, too long = resource exhaustion)
  2. Retries recover from transient failures but can cause retry storms that amplify outages
  3. Circuit breakers stop calling failing dependencies but need threshold tuning per dependency
  4. Bulkheads isolate failures to prevent cascade but reduce resource efficiency
  5. Load shedding protects capacity but requires priority decisions

The key insight: every resilience pattern has failure modes of its own. Circuit breakers can flap. Retry budgets can be exhausted by one bad actor. Bulkheads can be sized wrong. Resilience engineering is about layering imperfect defenses, not finding perfect solutions.

Understanding failure modes is prerequisite to designing resilience. The classic taxonomy from distributed systems literature:

The process crashes and stays crashed. Other processes can detect the failure (eventually) through timeout.

Characteristics:

  • Clean, detectable, recoverable
  • Process halts and performs no further actions
  • Detected via heartbeat timeout

Real-world example: An OOM-killed container. Kubernetes detects the failure through liveness probe timeout, restarts the pod. Recovery is straightforward.

Design implication: If you can guarantee fail-stop semantics (crash rather than corrupt), recovery is simpler. This is why many systems prefer crashing on inconsistency detection over attempting recovery.

The process crashes but may restart with partial state. More realistic than fail-stop for real systems.

Characteristics:

  • Process may restart with stale or partial state
  • Need to distinguish “slow” from “dead”
  • Requires stable storage for state recovery

Real-world example: A database replica crashes mid-replication, restarts, and must replay the Write-Ahead Log (WAL) to reach a consistent state. During replay, it may reject reads or serve stale data depending on configuration.

Design implication: Recovery procedures must be idempotent. Operations that were in-flight at crash time may re-execute.

The process fails to send or receive messages but otherwise operates correctly.

Characteristics:

  • Send omission: Process fails to send a message it should send
  • Receive omission: Process fails to receive a message sent to it
  • Often caused by full queues, network issues, or resource exhaustion

Real-world example: A message queue producer drops messages when the broker is slow (send omission). The producer continues operating, health checks pass, but data is being lost silently.

Design implication: Critical paths need acknowledgment and retry. Fire-and-forget is only acceptable for best-effort telemetry.

The process responds, but not within the expected time bound.

Characteristics:

  • Response is correct but late
  • Violates performance SLAs
  • May be indistinguishable from omission if timeout fires first

Real-world example: A database query that normally takes 10ms takes 5 seconds due to lock contention. The caller times out at 1 second, assumes failure, retries—potentially worsening the contention.

Design implication: Timeouts must account for the difference between “failed” and “slow.” Speculative execution (hedged requests) can help for idempotent operations.

The process exhibits arbitrary behavior—including malicious or irrational responses.

Characteristics:

  • Can send conflicting information to different parties
  • Can lie about its state
  • Requires BFT (Byzantine Fault Tolerant) protocols to handle

Real-world example: A corrupted replica returns incorrect data that passes format validation but has wrong values. Or a compromised node in a blockchain network attempts double-spending.

Design implication: Byzantine tolerance requires 3f+1 replicas to tolerate f failures and is expensive. Most internal systems assume non-Byzantine failures (crash-only model) and rely on other controls (authentication, checksums) for integrity.

The most dangerous failure mode in production systems. Introduced by Microsoft Research in their 2017 paper “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems.”

Characteristics:

  • Partial degradation that evades health checks
  • Different observers see different health states
  • Often manifests as elevated latency, increased error rate on specific paths, or degraded throughput

Why gray failures are dangerous:

  1. Health checks pass: The failing component responds to synthetic probes
  2. Partial impact: Only certain request types or users affected
  3. Cascading potential: Slow responses tie up caller resources
  4. Detection delay: May take minutes to hours to confirm

Real-world examples:

  • AWS S3 outage (2017): An operator command removed more capacity than intended. S3 continued responding but with elevated latency. Health checks passed. Cascading failures affected services depending on S3.
  • Cloudflare outage (2019): A WAF rule caused CPU exhaustion. Health checks using simple endpoints passed while actual traffic was failing.
  • Meta/Facebook outage (2021): BGP withdrawal caused DNS servers to become unreachable. Internal tools for recovery also depended on DNS.

Detection requires differential observability:

Observer A (health check): "Component is healthy"
Observer B (real traffic): "Requests are timing out"
Observer C (metrics): "P99 latency 10x normal"
Gray failure = divergence between observer perspectives

One component’s failure triggers failures in dependent components, amplifying the initial impact.

Cascade mechanisms:

MechanismDescriptionExample
Resource exhaustionFailing component ties up caller resourcesSlow DB → thread pool exhaustion → upstream timeout
Retry stormsFailed requests trigger coordinated retriesService restart → all clients retry simultaneously
Load redistributionRemaining capacity receives diverted trafficNode failure → remaining nodes overloaded → more failures
Dependency chainFailure propagates through call graphAuth service down → all authenticated endpoints fail

Real-world example: Amazon’s 2004 “retry storm” incident. A service experienced increased latency. Callers retried. The retries increased load, worsening latency. More retries. The positive feedback loop took down multiple services.

Mechanism: Periodic messages prove the sender is alive.

Design choices:

ParameterConsideration
IntervalShorter = faster detection, more overhead
Timeouttimeout = interval × missed_count + network_jitter
PayloadInclude load metrics for proactive detection

Common mistake: Using heartbeat interval = timeout. A single delayed heartbeat triggers false positive. Use at least 2-3 missed heartbeats before declaring failure.

Real-world tuning (Cassandra): Default phi_convict_threshold of 8 corresponds to ~10-12 seconds detection with default 1-second gossip interval. Tuned lower (6) for faster detection in low-latency networks.

Shallow vs. deep health checks:

TypeWhat It TestsFailure Modes
Shallow (TCP/HTTP)Process is running, port openMisses gray failures, resource exhaustion
Deep (functional)Critical path works end-to-endExpensive, may timeout, can mask partial failures
DependencyChecks downstream connectivityCan cause cascade if dependency is slow

Design decision: Separate liveness from readiness.

  • Liveness: “Should this process be restarted?” Only fail if the process is fundamentally broken.
  • Readiness: “Should traffic be routed here?” Fail if dependencies are unavailable or load is too high.

Anti-pattern: Health check that calls the database. If the DB is slow, health checks timeout, load balancer removes instance, traffic moves to other instances, they also become slow from increased load → cascade.

Detecting gray failures requires multiple perspectives:

  1. Active probing: Synthetic requests from multiple vantage points
  2. Passive observation: Real traffic metrics (latency, error rate)
  3. Internal instrumentation: Queue depths, thread pool utilization
  4. External correlation: Compare metrics across components

Implementation pattern:

// Simplified gray failure detection
grayFailureScore =
healthCheckSuccess × 0.2 + // Low weight: often passes during gray failure
errorRateBelowThreshold × 0.3 +
p99LatencyNormal × 0.3 +
throughputNormal × 0.2
// Flag gray failure when:
// - Health checks pass
// - But other signals are degraded
if (healthCheckSuccess && grayFailureScore < 0.7) {
alertGrayFailure()
}

Every network call needs a timeout. Without one, a hung dependency can exhaust caller resources indefinitely.

Design choices:

Mechanism: Fixed timeout value, configured per operation.

Best when:

  • Latency distribution is well-understood and stable
  • Operation has clear SLA requirements
  • Simple is preferred over optimal

Setting static timeouts:

timeout = p99_latency × safety_factor + network_jitter
Example:
- p99 latency: 100ms
- Safety factor: 2-3x
- Network jitter: 10ms
- Timeout: 100ms × 2.5 + 10ms = 260ms (round to 300ms)

Trade-offs:

  • Simple to understand and debug
  • Predictable behavior
  • Doesn’t adapt to changing conditions
  • May be too aggressive during GC pauses or too lenient during failures

Mechanism: Timeout adjusts based on observed latency.

Best when:

  • Latency varies significantly (e.g., batch processing, variable workloads)
  • Operations can tolerate variability
  • Team can operate more complex logic

Implementation approaches:

ApproachMechanismExample
Percentile-basedTimeout = p99 + bufferNetflix’s approach: p99 × 1.5
Moving averageExponentially weighted MA of recent latenciesgRPC’s adaptive approach
HedgingStart second request after p50 latencyGoogle’s “The Tail at Scale”

Trade-offs:

  • Adapts to changing conditions
  • Can be more aggressive when service is healthy
  • More complex to debug
  • Can oscillate under certain conditions
  • Cold start problem: no historical data

Real-world example (gRPC): gRPC’s wait-for-ready semantics include exponential backoff with jitter. Initial timeout is configurable; subsequent attempts use adaptive backoff capped at a maximum.

Retries recover from transient failures but can amplify problems when failures are persistent.

Design choices:

Mechanism: Each retry waits longer than the previous.

wait_time = min(base_delay × 2^attempt, max_delay)
Example (AWS SDK defaults):
- Base delay: 100ms
- Max delay: 20 seconds
- Attempt 1: 100ms
- Attempt 2: 200ms
- Attempt 3: 400ms
- Attempt 4: 800ms

Why exponential: Linear backoff can still cause coordinated load. Exponential spreads retries over time, reducing coincident retry probability.

Critical addition to backoff: Without jitter, clients that fail together retry together.

Jitter strategies:

StrategyFormulaUse Case
Full jitterrandom(0, calculated_delay)Default choice, maximum spread
Equal jittercalculated_delay/2 + random(0, calculated_delay/2)When you need some minimum wait
Decorrelated jitterrandom(base, previous_delay × 3)AWS recommendation, even better spread

AWS’s analysis (Marc Brooker’s blog): Decorrelated jitter showed best results in their simulations, completing work 40% faster than exponential backoff without jitter during contention.

Problem: Per-request retries can multiply load by the retry factor.

Normal: 100 requests/sec
With 3 retries: Up to 400 requests/sec during failure

Solution: Retry budgets limit total retry rate system-wide.

retry_budget = (total_requests × budget_ratio) - current_retries
Example (Envoy default):
- Budget ratio: 20%
- If baseline is 100 req/s, allow max 20 retries/s
- Once budget exhausted, no more retries until it refills

Real-world implementation (Envoy): Envoy implements retry budgets at the cluster level. Default is 20% of active requests can be retries. Prevents any single misbehaving upstream from consuming all retry capacity.

Not all errors should be retried:

ResponseRetry?Rationale
5xx Server ErrorYes, with backoffTransient; server may recover
429 Too Many RequestsYes, with longer backoffHonor Retry-After header if present
408 Request TimeoutYes, with backoffMay be transient
4xx Client ErrorNoRequest is malformed; retry won’t help
Connection refusedYes, limitedServer may be restarting
Connection resetYesNetwork blip
SSL handshake failureNoCertificate/config issue

Idempotency requirement: Only retry operations that are safe to repeat. POST requests that create resources need idempotency keys.

Mechanism: Stop calling a failing dependency, give it time to recover.

Failure threshold exceeded

Reset timeout expires

Probe succeeds

Probe fails

Closed

Open

HalfOpen

States:

StateBehavior
ClosedNormal operation; requests pass through; failures counted
OpenRequests fail immediately; no calls to dependency
Half-OpenLimited probe requests allowed; success resets to Closed

Design decisions:

StrategyMechanismBest For
Count-basedOpen after N failuresSimple, predictable
Rate-basedOpen when error rate > X%Handles varying traffic
ConsecutiveOpen after N consecutive failuresAvoids flapping on intermittent errors

Real-world example (Hystrix, now deprecated but patterns live on):

// Hystrix default configuration
circuitBreaker.requestVolumeThreshold = 20 // Min requests before evaluation
circuitBreaker.errorThresholdPercentage = 50 // Error rate to trip
circuitBreaker.sleepWindowInMilliseconds = 5000 // Time in Open before probe
ParameterConsideration
Error thresholdToo low = flapping; too high = slow detection
Volume thresholdPrevents opening on low-traffic services
Reset timeoutToo short = probe during ongoing failure; too long = delayed recovery
Half-open probesSingle probe is fragile; multiple probes add load

Anti-pattern: Circuit breaker per request type on same dependency. If the dependency is down, it’s down for all request types. Use per-dependency circuit breakers.

Netflix’s evolution: Hystrix was deprecated in 2018. Netflix moved to adaptive systems (Concurrency Limits) that don’t require manual threshold tuning. The adaptive approach measures actual latency and adjusts concurrency limits using TCP-like AIMD (Additive Increase, Multiplicative Decrease).

Mechanism: Isolate components so failure in one doesn’t exhaust shared resources.

Analogy: Ship bulkheads prevent flooding in one compartment from sinking the entire ship.

Design choices:

Mechanism: Separate thread pool per dependency.

Service A thread pool: 20 threads
Service B thread pool: 20 threads
Service C thread pool: 10 threads
If Service B hangs, only its 20 threads are blocked.
Service A and C continue operating.

Trade-offs:

  • Strong isolation
  • Failure is contained
  • Thread overhead (each pool has threads)
  • Under-utilization when traffic is uneven
  • Context switching cost

Mechanism: Limit concurrent requests via semaphore; requests run on caller’s thread.

Trade-offs:

  • No thread overhead
  • Lower latency (no queue/handoff)
  • If dependency is slow, caller thread is blocked
  • Timeout can’t interrupt blocked thread

When to use which:

Isolation TypeUse When
Thread poolDependency is unreliable; need timeout enforcement
SemaphoreDependency is reliable; latency is critical

Mechanism: Limit connections per downstream service.

Database connection pool: max 50
Cache connection pool: max 100
External API: max 20

Sizing consideration: Total connections across all instances shouldn’t exceed dependency’s capacity.

Instance count: 10
Per-instance connections: 50
Total: 500 connections
Database max_connections: 500 → OK
Database max_connections: 200 → Risk of connection exhaustion

Mechanism: Reject excess load to protect capacity for requests that can be served.

Design choices:

Mechanism: When queue is full, drop oldest requests (they’ve likely already timed out at caller).

Real-world example (Amazon): Amazon uses LIFO processing for some services. Requests that have been waiting longest are most likely to have callers that have already timed out—processing them wastes capacity.

Mechanism: Classify requests by importance; shed lower priority first.

PriorityExamplesShed First?
CriticalHealth checks, auth tokensNever (or last)
HighPaid user requests, real-time APIsLast resort
NormalStandard requestsWhen overloaded
LowAnalytics, batch jobsFirst

Implementation: Priority typically in request header or derived from path/user.

Mechanism: Reject requests at the edge before they consume resources.

Approaches:

ApproachMechanism
Token bucketAllow N requests per time window
Leaky bucketSmooth bursty traffic
AdaptiveAdjust limit based on current latency
Client-basedPer-client quotas

Google’s CoDel (Controlled Delay): Admission control based on queue latency rather than queue length. If requests are sitting in queue too long (target: 5ms), start dropping. This adapts automatically to varying service capacity.

When the primary path fails, what’s the backup?

StrategyDescriptionTrade-off
Cache fallbackReturn stale cached dataData may be outdated
Degraded modeReturn partial resultFeature may be incomplete
Static defaultReturn hardcoded responseMay not be appropriate for user
Fail silentReturn empty, continueData loss may go unnoticed
Fail fastReturn error immediatelyBad UX but honest

Design decision: Fallback appropriateness depends on the use case:

  • Product catalog: Cache fallback is fine; products don’t change every second
  • Account balance: Fail fast; showing stale data causes real harm
  • Recommendations: Degraded mode (show popular items) is reasonable
  • Authentication: No fallback; either authorized or not

Real-world example (Netflix): Netflix’s fallback hierarchy for recommendations:

  1. Personalized recommendations (primary)
  2. Genre-based recommendations (degraded)
  3. Top 10 in region (more degraded)
  4. Static curated list (last resort)

Each level provides worse UX but maintains core functionality.

Beyond individual resilience patterns, architectural decisions determine how far failures spread.

Mechanism: Divide the system into independent cells, each serving a subset of users.

AWS’s approach: Many AWS services use cell-based architecture. Each cell:

  • Has independent capacity and resources
  • Serves a subset of customers (by account ID hash)
  • Can fail without affecting other cells

Characteristics:

AspectCell Architecture
Blast radiusLimited to single cell (% of users)
ScalingAdd more cells
ComplexityCell-aware routing required
EfficiencySome redundancy across cells

Real-world example (AWS DynamoDB): DynamoDB uses partition-based isolation. Each partition is a cell. A hot partition affects only data in that partition; other partitions continue normally.

Mechanism: Assign each customer to a random subset of resources.

Why it works: With pure sharding, a bad actor on shard 3 affects all customers on shard 3. With shuffle sharding, the probability that two customers share all resources is exponentially small.

Math:

Workers: 8
Workers per customer: 2
Combinations: C(8,2) = 28
Probability two random customers share both workers: 1/28 ≈ 3.6%

Scaling benefit: Adding more workers exponentially reduces collision probability.

Real-world example (AWS Route 53): Route 53 uses shuffle sharding for DNS nameservers. Each customer gets 4 nameservers from a pool of hundreds. A DDoS on one customer’s nameservers has minimal probability of affecting another customer’s full set.

Design choices:

ScopeUse CaseTrade-off
Single zoneMinimize latencyNo zone failure tolerance
Multi-zoneSurvive zone failureCross-zone data transfer cost
Multi-regionSurvive region failureComplexity, latency, consistency challenges

Capacity planning for zone failure:

Required capacity: 1000 req/s
Zones: 3
Per-zone capacity if any zone must handle full load: 1000 req/s each
Total provisioned: 3000 req/s (3x over-provisioned)
Alternative: Accept degradation during zone failure
Per-zone capacity: 500 req/s
Normal operation: 1500 req/s capacity, 67% headroom
During zone failure: 1000 req/s capacity, 0% headroom

Chaos engineering is empirical validation of system resilience. The goal is to discover weaknesses before they manifest in production incidents.

Core process:

  1. Define steady state (what “normal” looks like)
  2. Hypothesize that steady state continues during failure
  3. Inject failure
  4. Observe if hypothesis holds
  5. Fix discovered weaknesses

Chaos is not: Random destruction in production. It’s controlled experimentation with blast radius limits.

CategoryExamples
InfrastructureVM termination, disk failure, network partition
ApplicationProcess crash, memory exhaustion, CPU contention
NetworkLatency injection, packet loss, DNS failure
DependencyDatabase slowdown, cache failure, third-party API error
ToolScopeApproach
Chaos MonkeyInstanceRandomly terminates instances
Chaos KongRegionSimulates entire region failure
Latency MonkeyNetworkInjects artificial latency
AWS FISAWS resourcesManaged fault injection
GremlinMulti-platformCommercial chaos platform
LitmusChaosKubernetesCNCF chaos project

Scheduled chaos exercises with teams prepared to observe and respond.

Netflix’s approach:

  1. Announce: Teams know chaos is happening (builds confidence)
  2. Define scope: Which services, what failures
  3. Establish abort criteria: When to stop the experiment
  4. Execute: Run chaos scenario
  5. Observe: Monitor dashboards, alerts, user impact
  6. Debrief: Document findings, prioritize fixes

Progression:

  1. Start in non-production (staging/dev)
  2. Graduate to production during low-traffic periods
  3. Eventually run during peak (Netflix runs Chaos Monkey continuously)

Real-world example (AWS): AWS runs “Game Days” internally where teams deliberately break dependencies to validate runbooks and recovery procedures. Findings feed back into service design.

The mistake: Immediate retry on failure.

Why it happens: Developer thinks “if it failed, try again immediately.”

The consequence: Retry storm. A service that failed under load receives 3x the load from retries, guaranteeing it stays down.

The fix: Always use exponential backoff with jitter. Consider retry budgets.

The mistake: Service A calls Service B with 5s timeout. Service B calls Service C with 10s timeout.

Why it happens: Each team sets timeouts independently.

The consequence: Service B may wait 10s for C, but A has already timed out at 5s. B’s work is wasted.

The fix: Propagate deadline through the call chain. Each service’s timeout = min(remaining_deadline, own_timeout).

The mistake: Health check is a simple “ping” endpoint that always returns 200.

Why it happens: Simplest health check to implement.

The consequence: Gray failures go undetected. Service continues receiving traffic despite being degraded.

The fix: Health checks should test critical path. But be careful: deep health checks can cause cascading failures. Use readiness probes that check dependencies, liveness probes that don’t.

The mistake: Error threshold too high or volume threshold never reached.

Why it happens: Defaults may not match your traffic patterns.

The consequence: Circuit breaker provides no protection; might as well not exist.

The fix: Tune based on actual traffic. Log circuit breaker state transitions. Alert on unexpected closed→open transitions.

The mistake: Thread pool of 100 for a dependency that can handle 10 concurrent requests.

Why it happens: Bulkhead sized for expected traffic, not dependency capacity.

The consequence: Bulkhead allows 100 requests through; dependency overwhelmed.

The fix: Size bulkheads to dependency capacity, not caller expectations.

The mistake: Primary path calls Service A. Fallback also calls Service A (different endpoint).

Why it happens: Fallback reuses same client/connection.

The consequence: If Service A is down, both primary and fallback fail.

The fix: Fallback must be independent of primary failure. Use cache, static data, or different service.

Real-world example (Meta outage, 2021): Engineers couldn’t access internal tools to fix the BGP issue because internal tools depended on the same DNS/network that was down. Recovery required physical access to data centers.

Failure handling is not a feature to add later—it’s a fundamental design concern. Key principles:

  1. Assume failure: Every dependency will fail. Design for it.
  2. Detect quickly: Gray failures are the hardest; use differential observability.
  3. Contain blast radius: Bulkheads, cells, and shuffle sharding limit impact.
  4. Degrade gracefully: Fallbacks should provide value even when degraded.
  5. Recover automatically: Self-healing > operator intervention.
  6. Validate continuously: Chaos engineering proves resilience; don’t just hope.

The goal isn’t to prevent all failures—it’s to ensure that when failures occur (and they will), the system degrades gracefully and recovers quickly.

  • Basic distributed systems concepts (network partitions, replication)
  • Familiarity with microservice architecture
  • Understanding of latency percentiles (p50, p99)
  • Failure taxonomy: Fail-stop < crash-recovery < omission < timing < Byzantine. Gray failures are the most dangerous—partial degradation that evades health checks.
  • Detection: Differential observability (multiple perspectives) catches gray failures that single health checks miss.
  • Resilience patterns: Timeouts → retries (with backoff + jitter + budgets) → circuit breakers → bulkheads → load shedding → fallbacks. Layer them.
  • Blast radius: Cell architecture and shuffle sharding limit how many users are affected.
  • Chaos engineering: Empirically validate resilience. Start small, graduate to production.
  • Key insight: Every resilience pattern has its own failure modes. Defense in depth, not silver bullets.
  • Huang, P., et al. “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems” (HotOS 2017) - Microsoft Research paper defining gray failures
  • Nygard, M. “Release It!” - Foundational book on stability patterns (circuit breaker, bulkhead originated here)
  • Brooker, M. “Exponential Backoff and Jitter” - AWS Architecture Blog analysis of jitter strategies
  • Dean, J. and Barroso, L. “The Tail at Scale” - Google paper on hedged requests and latency percentiles
  • Netflix Tech Blog - Hystrix, resilience patterns, and chaos engineering
  • Amazon Builders’ Library - Timeouts, retries, and backoff; cell-based architecture
  • Google SRE Book - Chapter on handling overload
  • Kubernetes Documentation - Liveness and readiness probes design patterns

Read more

  • Previous

    Time and Ordering in Distributed Systems

    System Design / System Design Fundamentals 19 min read

    Understanding how distributed systems establish event ordering without a global clock—from the fundamental problem of clock synchronization to the practical algorithms that enable causal consistency, conflict resolution, and globally unique identifiers.Time in distributed systems is not what it seems. Physical clocks drift, networks delay messages unpredictably, and there’s no omniscient observer to stamp events with “true” time. Yet ordering events correctly is essential for everything from database transactions to chat message display. This article explores the design choices for establishing order in distributed systems, when each approach makes sense, and how production systems like Spanner, CockroachDB, and Discord have solved these challenges.

  • Next

    Storage Choices: SQL vs NoSQL

    System Design / System Design Fundamentals 20 min read

    Choosing between SQL and NoSQL databases based on data model requirements, access patterns, consistency needs, and operational constraints. This guide presents the design choices, trade-offs, and decision factors that drive storage architecture decisions in production systems.