Core Distributed Patterns
20 min read

Graceful Degradation

Graceful degradation is the discipline of designing distributed systems that maintain partial functionality when components fail, rather than collapsing entirely. The core insight: a system serving degraded responses to all users is preferable to one returning errors to most users. This article covers the pattern variants, implementation trade-offs, and production strategies that separate resilient systems from fragile ones.

Failure

Degraded States

Healthy State

Component Failure

Overload

Write Path Down

Cache Miss

Cascade

All Fallbacks Exhausted

Recovery

Recovery

Recovery

Full Functionality

Stale Data

Reduced Features

Read-Only Mode

Static Fallback

Complete Outage

Graceful degradation creates multiple intermediate states between full health and complete failure, with recovery paths back to normal operation.

Graceful degradation transforms hard dependencies into soft dependencies through a hierarchy of fallback behaviors. The mental model:

  1. Degradation Hierarchy: Systems define ordered fallback states—from full functionality through progressively simpler modes down to static responses—each trading capability for availability
  2. Failure Isolation: Patterns like circuit breakers, bulkheads, and timeouts contain failures to prevent cascade propagation across service boundaries
  3. Load Management: Admission control and load shedding protect system capacity by rejecting excess work early, keeping latency acceptable for admitted requests
  4. Recovery Coordination: Backoff with jitter prevents thundering herd on recovery; retry budgets cap amplification during degraded states

The key design tension: aggressive fallbacks improve availability but may serve stale or incomplete data. Conservative fallbacks preserve correctness but risk cascade failures. Production systems typically err toward availability, with explicit SLOs defining acceptable staleness.

Approach 1: Fail-Fast Everything

Returning errors immediately when any dependency is unavailable seems honest, but propagates failures upstream. A single slow database query can cascade through dozens of dependent services, each timing out and returning errors to their callers.

Approach 2: Infinite Retries

Retrying failed requests until they succeed appears resilient, but creates retry storms. If a service handles 10,000 requests per second and fails for 10 seconds, naive retries generate 100,000+ additional requests, overwhelming any recovery attempt.

Approach 3: Long Timeouts

Setting generous timeouts (30+ seconds) to “wait for things to recover” exhausts connection pools and threads. A service with 100 threads and 30-second timeouts can only handle 3.3 requests/second during a slowdown—a 1000x capacity reduction.

Distributed systems face a fundamental tension: availability versus correctness. When a dependency fails, you must choose between:

  • Returning an error (correct but unavailable)
  • Returning stale/incomplete data (available but potentially incorrect)
  • Blocking until recovery (neither available nor responsive)

Graceful degradation provides a framework for making this choice deliberately, with explicit trade-offs documented in SLOs.

Graceful degradation works by defining a degradation hierarchy—an ordered list of fallback behaviors activated as failures accumulate:

LevelStateBehaviorTrade-off
0HealthyFull functionalityNone
1DegradedServe cached/stale dataStaleness vs availability
2LimitedDisable non-critical featuresFunctionality vs stability
3MinimalRead-only modeWrites lost vs reads preserved
4StaticReturn default responsesPersonalization vs uptime
5UnavailableReturn errorComplete failure

Each level represents an explicit trade-off. The system progresses through levels only when lower levels become untenable.

  1. Monotonic Degradation: Systems move through degradation levels in order; skipping from healthy to unavailable without intermediate states indicates missing fallbacks
  2. Bounded Impact: Any single component failure affects only the features depending on that component; unrelated functionality continues normally
  3. Explicit Recovery: Systems don’t automatically return to healthy state—circuit breakers test recovery, and operators may gate full restoration
FailureImpactMitigation
Cascade failureOne service failure propagates to all dependentsCircuit breakers, timeouts, bulkheads
Retry stormFailed requests amplified by retriesExponential backoff, jitter, retry budgets
Thundering herdSimultaneous recovery overwhelms systemStaggered recovery, jitter on first request
Stale data servedUsers see outdated informationTTL limits, staleness indicators in UI
Split brainDifferent servers in different degradation statesCentralized health checks, consensus on state

When to choose this path:

  • Dependencies have distinct failure modes (timeout, error, slow)
  • You need automatic recovery detection
  • Service mesh or library support available

Key characteristics:

The circuit breaker monitors call success rates and “trips” when failures exceed a threshold, preventing further calls to a failing service. This gives the dependency time to recover without continued load.

Three states:

Failure threshold exceeded

Wait duration elapsed

Test calls succeed

Test calls fail

Closed

Open

HalfOpen

  • Closed: Normal operation, requests pass through, failures tracked
  • Open: Requests fail immediately without calling dependency
  • Half-Open: Limited test requests probe for recovery

Implementation approach:

circuit-breaker.ts
5 collapsed lines
import { EventEmitter } from "events"
type CircuitState = "closed" | "open" | "half-open"
interface CircuitBreakerConfig {
failureThreshold: number // Failures before opening (e.g., 5)
successThreshold: number // Successes to close (e.g., 2)
timeout: number // Time in open state (e.g., 30000ms)
monitorWindow: number // Sliding window size (e.g., 10)
}
class CircuitBreaker {
private state: CircuitState = "closed"
private failures = 0
private successes = 0
private lastFailureTime = 0
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (Date.now() - this.lastFailureTime > this.config.timeout) {
this.state = "half-open"
this.successes = 0
} else {
throw new CircuitOpenError("Circuit is open")
}
}
try {
const result = await fn()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
throw error
16 collapsed lines
}
}
private onSuccess(): void {
this.failures = 0
if (this.state === "half-open") {
this.successes++
if (this.successes >= this.config.successThreshold) {
this.state = "closed"
}
}
}
private onFailure(): void {
this.failures++
this.lastFailureTime = Date.now()
if (this.failures >= this.config.failureThreshold) {
this.state = "open"
}
}
}

Production configuration (Resilience4j):

ParameterTypical ValueRationale
failureRateThreshold50%Opens when half of recorded calls fail
slidingWindowSize10-20 callsEnough samples for statistical significance
minimumNumberOfCalls5Don’t trip on first few failures
waitDurationInOpenState10-30sTime for dependency to recover
permittedNumberOfCallsInHalfOpenState2-3Enough probes to confirm recovery

Real-world example: Netflix Hystrix

Netflix pioneered circuit breakers at scale, processing tens of billions of thread-isolated calls daily. Their key insight: the circuit breaker’s fallback must be simpler than the primary path.

“The fallback is for giving users a reasonable response when the circuit is open. It shouldn’t try to be clever—a simple cached value or default is better than complex retry logic.” — Netflix Tech Blog

Hystrix is now in maintenance mode; Netflix recommends Resilience4j for new projects, which uses a functional composition model:

Resilience4jExample.java
8 collapsed lines
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.vavr.control.Try;
import java.time.Duration;
import java.util.function.Supplier;
// Configuration
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(10))
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("userService", config);
7 collapsed lines
// Usage - functional composition
Supplier<User> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> userService.getUser(userId));
Try<User> result = Try.ofSupplier(decoratedSupplier)
.recover(throwable -> getCachedUser(userId)); // Fallback

Trade-offs vs other paths:

AspectCircuit BreakerTimeout OnlyRetry Only
Failure detectionAutomatic (threshold)Per-requestNone
Recovery detectionAutomatic (half-open)ManualNone
OverheadState tracking per dependencyMinimalMinimal
ConfigurationMultiple parametersSingle timeoutRetry count, backoff
Best forUnstable dependenciesSlow dependenciesTransient failures

When to choose this path:

  • System receives traffic spikes beyond capacity
  • Some requests are more important than others
  • You control the server-side admission logic

Key characteristics:

Load shedding rejects excess requests before they consume resources, keeping latency acceptable for admitted requests. The key insight from AWS:

“The goal is to keep serving good latencies to the requests you do accept, rather than serving bad latencies to all requests.” — AWS Builders’ Library

Implementation approaches:

Server-side admission control:

load-shedder.ts
3 collapsed lines
import { Request, Response, NextFunction } from "express"
interface LoadShedderConfig {
maxConcurrent: number // Max concurrent requests
maxQueueSize: number // Max waiting requests
priorityHeader: string // Header indicating request priority
}
class LoadShedder {
private currentRequests = 0
private queueSize = 0
middleware = (req: Request, res: Response, next: NextFunction) => {
const priority = this.getPriority(req)
const capacity = this.getAvailableCapacity()
// High priority: always admit if any capacity
// Low priority: only admit if significant capacity
const threshold = priority === "high" ? 0 : 0.3
if (capacity < threshold) {
res.status(503).header("Retry-After", "5").send("Service overloaded")
return
}
this.currentRequests++
res.on("finish", () => this.currentRequests--)
next()
}
13 collapsed lines
private getPriority(req: Request): "high" | "low" {
// Payment endpoints are always high priority
if (req.path.includes("/checkout") || req.path.includes("/payment")) {
return "high"
}
return (req.headers[this.config.priorityHeader] as "high" | "low") ?? "low"
}
private getAvailableCapacity(): number {
return 1 - this.currentRequests / this.config.maxConcurrent
}
}

Priority-based shedding:

PriorityShed When Capacity BelowExamples
Critical0% (never shed)Health checks, payment processing
High20%User-facing reads, search
Medium40%Background syncs, analytics events
Low60%Batch jobs, reports, prefetching

Real-world example: Shopify

Shopify handles 100x traffic spikes during flash sales by shedding non-essential features:

  • Always preserved: Checkout, payment processing, order confirmation
  • Shed early: Product recommendations, recently viewed, wish lists
  • Shed under load: Search suggestions, inventory counts, shipping estimates

“Our pod model means each merchant’s traffic is isolated. But within a pod, we have explicit rules: checkout never degrades. Everything else can pause.” — Shopify Engineering

Real-world example: Google GFE

Google’s Global Front End (GFE) implements multi-tier admission control:

  1. Connection limits: Cap TCP connections per client IP
  2. Request rate limits: Per-user and per-API quotas
  3. Priority queues: Critical traffic bypasses congestion
  4. Adaptive shedding: Increase rejection rate as CPU approaches saturation

Google SRE recommends:

  • Per-request retry cap: Maximum 3 attempts
  • Per-client retry budget: Keep retries under 10% of normal traffic

Trade-offs:

AspectLoad SheddingNo Shedding
Latency under loadStable for admittedDegrades for all
Throughput under loadMaintainedCollapses
User experienceSome see errorsAll see slow
ImplementationRequires priority schemeSimpler
Capacity planningCan overprovision lessNeed headroom for spikes

When to choose this path:

  • Multiple independent workloads share resources
  • One workload’s failure shouldn’t affect others
  • You need blast radius containment

Key characteristics:

Named after ship compartments that prevent a hull breach from sinking the entire vessel, bulkheads isolate failures to affected components.

Implementation levels:

LevelIsolation UnitUse Case
Thread poolsPer-dependency thread poolDifferent latency profiles
Connection poolsPer-service connection limitsDatabase isolation
ProcessSeparate processes/containersComplete memory isolation
CellIndependent infrastructure stacksRegional blast radius

Thread pool isolation:

bulkhead.ts
4 collapsed lines
import { Worker } from "worker_threads"
import { Queue } from "./queue"
interface BulkheadConfig {
maxConcurrent: number // Max parallel executions
maxWait: number // Max queue time (ms)
name: string // For metrics/logging
}
class Bulkhead {
private executing = 0
private queue: Queue<() => void> = new Queue()
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.executing >= this.config.maxConcurrent) {
if (this.queue.size >= this.config.maxConcurrent) {
throw new BulkheadFullError(`${this.config.name} bulkhead full`)
}
await this.waitForCapacity()
}
this.executing++
try {
return await fn()
} finally {
this.executing--
this.queue.dequeue()?.()
}
}
11 collapsed lines
private waitForCapacity(): Promise<void> {
return new Promise((resolve, reject) => {
const timeout = setTimeout(() => {
reject(new BulkheadTimeoutError(`${this.config.name} wait timeout`))
}, this.config.maxWait)
this.queue.enqueue(() => {
clearTimeout(timeout)
resolve()
})
})
}
}

Real-world example: AWS Cell-Based Architecture

AWS uses cell-based architecture for blast radius containment:

Cell 3

Cell 2

Cell 1

Load Balancer

Services

Database

Load Balancer

Services

Database

Load Balancer

Services

Database

Global Router

Each cell:

  • Handles a subset of customers (often by customer ID hash)
  • Has independent database, cache, and service instances
  • Shares no state with other cells
  • Can fail without affecting other cells

Real-world example: Shopify Pod Model

Shopify isolates each merchant into a “pod”—a fully independent slice with its own:

  • MySQL primary and replicas
  • Redis cluster
  • Memcached nodes
  • Background job workers

A pod failure affects only merchants in that pod, not the entire platform.

Trade-offs:

AspectBulkheadShared Resources
Resource efficiencyLower (duplication)Higher
Blast radiusContainedSystem-wide
Operational complexityHigher (more units)Lower
CostHigherLower
Recovery timeFaster (smaller scope)Slower

When to choose this path:

  • Failures are transient (network blips, temporary overload)
  • Idempotent operations (safe to retry)
  • Acceptable latency budget for retries

Key characteristics:

Timeouts prevent resource exhaustion from slow dependencies. Retries handle transient failures. Combined poorly, they create retry storms. Combined well, they provide resilience without amplification.

Timeout configuration:

Start with P99.9 latency plus 20-30% buffer:

MetricValueTimeout
P50 latency20ms
P90 latency80ms
P99 latency300ms
P99.9 latency800ms1000ms (800 + 25%)

Retry with exponential backoff and jitter:

retry.ts
2 collapsed lines
interface RetryConfig {
maxAttempts: number // Total attempts (including first)
baseDelay: number // Initial delay (ms)
maxDelay: number // Cap on delay (ms)
jitterFactor: number // 0-1, randomness factor
}
async function retryWithBackoff<T>(fn: () => Promise<T>, config: RetryConfig): Promise<T> {
let lastError: Error
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
try {
return await fn()
} catch (error) {
lastError = error
if (attempt < config.maxAttempts - 1) {
const delay = calculateDelay(attempt, config)
await sleep(delay)
}
}
}
throw lastError
}
10 collapsed lines
function calculateDelay(attempt: number, config: RetryConfig): number {
// Exponential: baseDelay * 2^attempt
const exponential = config.baseDelay * Math.pow(2, attempt)
// Cap at maxDelay
const capped = Math.min(exponential, config.maxDelay)
// Add jitter: multiply by random factor between (1 - jitter) and (1 + jitter)
const jitter = 1 + (Math.random() * 2 - 1) * config.jitterFactor
return Math.floor(capped * jitter)
}

Why jitter matters:

Without jitter, all clients retry at the same intervals, creating synchronized spikes:

Time: 0s 1s 2s 4s 8s
Client A: X R R R R
Client B: X R R R R
Client C: X R R R R
↓ ↓ ↓ ↓ ↓
Load: 3 3 3 3 3 (spikes)

With jitter, retries spread across time:

Time: 0s 1s 2s 3s 4s 5s
Client A: X R R
Client B: X R R
Client C: X R R
↓ ↓ ↓ ↓ ↓ ↓ ↓
Load: 3 1 1 1 1 1 1 (distributed)

Jitter the first request too:

Sophie Bits (Cloudflare) notes that even the first request needs jitter when many clients start simultaneously (deployment, recovery):

// Before making the first request
await sleep(Math.random() * initialJitterWindow)

Retry budgets (Google SRE approach):

Instead of per-request retry limits, track retries as a percentage of traffic:

retry-budget.ts
3 collapsed lines
class RetryBudget {
private requestCount = 0
private retryCount = 0
private readonly budgetPercent = 0.1 // 10% of traffic
canRetry(): boolean {
// Allow retry only if retries are under budget
return this.retryCount < this.requestCount * this.budgetPercent
}
recordRequest(): void {
this.requestCount++
}
recordRetry(): void {
this.retryCount++
}
// Reset counters periodically (e.g., every minute)
reset(): void {
this.requestCount = 0
this.retryCount = 0
}
}

Production configuration:

ParameterValueRationale
Max attempts3Diminishing returns beyond 3
Base delay100msFast enough for user-facing
Max delay30-60sPrevent indefinite waits
Jitter factor0.550% randomness spreads load
Retry budget10%Caps amplification

Trade-offs:

AspectAggressive RetriesConservative Retries
Transient failure recoveryBetterWorse
Amplification riskHigherLower
User-perceived latencyVariablePredictable
Resource consumptionHigherLower

When to choose this path:

  • Some response is better than no response
  • Acceptable to serve stale or simplified data
  • Clear degradation hierarchy defined

Key characteristics:

Fallbacks define what to return when the primary path fails. The hierarchy typically follows: fresh data → cached data → default data → error.

Fallback hierarchy:

fallback-chain.ts
4 collapsed lines
interface FallbackChain<T> {
primary: () => Promise<T>
fallbacks: Array<{
name: string
fn: () => Promise<T>
condition?: (error: Error) => boolean
}>
default: T
}
async function executeWithFallbacks<T>(chain: FallbackChain<T>): Promise<{
result: T
source: string
degraded: boolean
}> {
// Try primary
try {
return { result: await chain.primary(), source: "primary", degraded: false }
} catch (primaryError) {
// Try fallbacks in order
for (const fallback of chain.fallbacks) {
if (fallback.condition && !fallback.condition(primaryError)) {
continue
}
try {
const result = await fallback.fn()
return { result, source: fallback.name, degraded: true }
} catch {
// Continue to next fallback
}
}
// All fallbacks failed, return default
return { result: chain.default, source: "default", degraded: true }
}
11 collapsed lines
}
// Usage example
const productChain: FallbackChain<Product> = {
primary: () => productService.getProduct(id),
fallbacks: [
{ name: "cache", fn: () => cache.get(`product:${id}`) },
{ name: "cdn", fn: () => cdnCache.getProduct(id) },
],
default: { id, name: "Product Unavailable", price: null },
}

Common fallback patterns:

PatternUse CaseTrade-off
Cached dataRead-heavy workloadsStaleness
Default valuesConfiguration, feature flagsLoss of personalization
Simplified responseComplex aggregationsIncomplete data
Read-only modeWrite path failuresNo updates
Static contentComplete backend failureNo dynamic data

Real-world example: Netflix Recommendations

When Netflix’s recommendation service is slow or unavailable:

  1. Primary: Personalized recommendations from ML pipeline
  2. Fallback 1: Cached recommendations (updated hourly)
  3. Fallback 2: Popular content in user’s region
  4. Fallback 3: Globally popular content
  5. Default: Static “Trending Now” list

The UI doesn’t change—users see recommendations regardless of which tier served them.

Real-world example: Feature Flag Fallbacks

PostHog improved feature flag resilience by implementing local evaluation:

  • Primary: Real-time flag evaluation via API (500ms latency)
  • Fallback: Local evaluation with cached definitions (10-20ms)
  • Default: Hard-coded default values

Result: Flags work even if PostHog’s servers are unreachable.

Trade-offs:

AspectRich FallbacksSimple Fallbacks
User experienceBetter degraded UXWorse degraded UX
Implementation complexityHigherLower
Testing burdenHigher (more paths)Lower
Cache infrastructureRequiredOptional
Staleness riskHigherLower

Slow/unresponsive

Overloaded

Transient errors

Complete failure

Server-side

Client-side

Yes

No

Yes

No

Dependency Failure Handling?

Failure mode?

Circuit Breaker + Timeout

Control admission?

Retry with Backoff

Fallback Chain

Load Shedding

Rate Limiting

Need isolation?

Add Bulkhead

Circuit Breaker alone

Idempotent?

Retry safe

Retry only reads

Context: Netflix serves 200M+ subscribers with hundreds of microservices. A single page load touches dozens of services.

Implementation evolution:

  1. 2011-2012: Hystrix developed internally
  2. 2012: Hystrix open-sourced, became industry standard
  3. 2018: Hystrix enters maintenance mode
  4. 2019+: Resilience4j recommended for new projects

Why the transition?

Hystrix was designed for Java 6/7 with thread pool isolation as the primary mechanism. Resilience4j uses:

  • Java 8 functional composition
  • Lighter weight (only Vavr dependency)
  • Composable decorators (stack circuit breaker + rate limiter + retry)
  • Better metrics integration

Key learnings from Netflix:

“Invest more in making your primary code reliable than in building elaborate fallbacks. A fallback you’ve never tested in production is a fallback that doesn’t work.”

Specific metrics:

  • Hystrix processed tens of billions of thread-isolated calls daily
  • Hundreds of billions of semaphore-isolated calls daily
  • Circuit breaker trip rate: target < 0.1% during normal operation

Context: AWS services must maintain regional isolation—a failure in us-east-1 shouldn’t affect eu-west-1.

Implementation:

  • Each cell is an independent stack (compute, storage, networking)
  • Cells share no state; each has own database
  • Shuffle sharding maps customers to multiple cells, minimizing correlated failures
  • Global routing layer directs traffic to healthy cells

Configuration:

  • Cell size: 1-5% of total capacity per cell
  • Minimum cells: 3 per region for redundancy
  • Health check interval: 5-10 seconds
  • Failover time: < 60 seconds

Results:

  • Blast radius limited to cell size (1-5% of customers)
  • Regional failures don’t cascade globally
  • Recovery is cell-by-cell, not all-or-nothing

Context: Slack’s CI/CD pipeline had cascading failures between systems, causing developer frustration.

Implementation:

Slack’s Checkpoint system implements orchestration-level circuit breakers:

  1. Metrics collection: Error rates, latency percentiles, queue depths
  2. Awareness phase: Alert teams when error rates rise (before tripping)
  3. Trip decision: Automated trip when thresholds exceeded
  4. Recovery: Gradual traffic increase after manual verification

Results (since 2020):

  • No cascading failures between CI/CD systems
  • Increased service availability
  • Fewer flaky developer experiences

“The key insight was sharing visibility before opening the circuit. Teams get a heads-up that their system is approaching the threshold, often fixing issues before the breaker trips.”

Context: Shopify handles 30TB+ data per minute during peak sales, with 100x traffic spikes.

Implementation:

  • Pod model: Each merchant assigned to a pod
  • Pod contents: Dedicated MySQL, Redis, Memcached, workers
  • Graceful degradation rules:
    • Checkout: Never degrades
    • Cart: Degrades last
    • Recommendations: First to shed
    • Inventory counts: Can show stale data

Tools:

  • Toxiproxy: Simulates network failures before production
  • Packwerk: Enforces module boundaries in monolith

Results:

  • Flash sales handled without checkout degradation
  • Merchant isolation prevents noisy neighbor problems
  • Predictable performance under load

Context: Discord handles 1M+ push requests per minute with extreme traffic spikes during gaming events.

Implementation:

  • GenStage (Elixir): Built-in backpressure for message processing
  • Request coalescing: Deduplicate identical requests in Rust services
  • Consistent hash routing: Same requests route to same servers, improving deduplication

Results:

  • Eliminated hot partitions
  • Stable latencies during 100x spikes
  • Fewer on-call emergencies

Low

Medium

High

No

Yes

JVM

Node.js

Go

Python

Need graceful degradation

Team experience?

Start with service mesh

Use resilience library

Custom requirements?

Build custom patterns

Istio / Linkerd

Language?

Resilience4j

opossum / cockatiel

go-resiliency / hystrix-go

pybreaker / tenacity

LibraryLanguagePatternsMaturityNotes
Resilience4jJava/KotlinCB, Retry, Bulkhead, RateLimiterProductionNetflix recommended
Polly.NETCB, Retry, Timeout, BulkheadProductionExtensive policy composition
opossumNode.jsCircuit BreakerProductionSimple, well-tested
cockatielNode.jsCB, Retry, Timeout, BulkheadProductionTypeScript-first
go-resiliencyGoCB, Retry, SemaphoreProductionSimple, idiomatic Go
IstioService MeshCB, Retry, TimeoutProductionNo code changes, YAML config
istio-destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: user-service
spec:
host: user-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
istio-virtual-service.yaml
8 collapsed lines
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service
spec:
hosts:
- user-service
http:
- route:
- destination:
host: user-service
timeout: 5s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
  • Define degradation hierarchy: Document each level and its trade-offs
  • Identify critical paths: Which features must never degrade?
  • Set timeout SLOs: Based on P99.9 + buffer, not arbitrary values
  • Configure circuit breakers: Tune thresholds for your traffic patterns
  • Implement fallbacks: Test that they actually work under failure
  • Add retry budgets: Prevent amplification (< 10% of traffic)
  • Instrument everything: Metrics for each degradation level
  • Test failure modes: Chaos engineering before production incidents
  • Document recovery procedures: How to verify system is healthy

The mistake: Building fallback paths that are never exercised in production.

Example: A team builds an elaborate cache fallback for their user service. In production, the cache is always warm, so the fallback code path is never executed. When the cache fails during an incident, the fallback has a bug that causes a null pointer exception.

Solutions:

  • Chaos engineering: Regularly fail dependencies in production
  • Fallback testing: Exercise fallback paths in staging/canary
  • Synthetic monitoring: Periodically call fallback paths directly

The mistake: Retrying immediately without backoff or jitter.

Example: A service has 10,000 clients. The database has a 1-second outage. Without jitter, all 10,000 clients retry at exactly 1 second, then 2 seconds, then 4 seconds—creating synchronized spikes that prevent recovery.

Solutions:

  • Always use exponential backoff with jitter
  • Implement retry budgets (< 10% of traffic)
  • Jitter the first request after deployment/restart

The mistake: Circuit breaker opens and closes rapidly, worse than no breaker.

Example: A circuit breaker with failureThreshold=2 and successThreshold=1 trips after two failures, closes after one success, trips again immediately. The constant state changes create overhead without providing stability.

Solutions:

  • Increase minimumNumberOfCalls (at least 5-10)
  • Use sliding window for failure rate, not raw counts
  • Extend waitDurationInOpenState to allow real recovery
  • Require multiple successes in half-open state

The mistake: Serving cached data without indicating staleness to users.

Example: A stock trading app falls back to cached prices during an outage. Users see prices from 30 minutes ago but think they’re current, making trades based on stale data.

Solutions:

  • Show “as of” timestamps prominently
  • Visual indicators for degraded state (yellow banner, icon)
  • Disable actions that require fresh data (trading, purchasing)
  • Set TTL limits on how stale data can be served

The mistake: Sharing thread pools/connection pools across all dependencies.

Example: Service A has 100 threads. Dependency B becomes slow (10s response time). All 100 threads block on B. Dependency C (healthy) becomes unreachable because no threads are available—a slow dependency causes total failure.

Solutions:

  • Separate thread pools per dependency
  • Connection pool limits per downstream service
  • Semaphore isolation for fast dependencies
  • Thread pool isolation for slow/risky dependencies

The mistake: Shedding requests randomly instead of by priority.

Example: Under load, a checkout service sheds 50% of requests randomly. Half of payment confirmations are lost, causing customer support incidents. Meanwhile, prefetch requests for thumbnails continue consuming capacity.

Solutions:

  • Define priority tiers (critical, high, medium, low)
  • Shed lowest priority first
  • Never shed critical paths (payments, health checks)
  • Tag requests with priority at entry point

Graceful degradation is not a single pattern but a discipline: designing systems with explicit failure modes, testing those modes regularly, and accepting that partial functionality serves users better than complete outages.

The most resilient systems share these characteristics:

  • Degradation hierarchy is documented: Teams know exactly what fails first and what never fails
  • Fallbacks are tested: Chaos engineering proves fallbacks work before incidents
  • Amplification is controlled: Retry budgets and jitter prevent self-inflicted outages
  • Blast radius is contained: Bulkheads ensure one failure doesn’t become total failure

Start simple—timeouts and circuit breakers cover most cases. Add complexity (load shedding, cell architecture) only when scale demands it.

  • Distributed systems fundamentals (network partitions, CAP theorem)
  • Service-oriented architecture concepts
  • Basic understanding of observability (metrics, traces, logs)
  • Blast Radius: The scope of impact when a component fails—smaller is better
  • Bulkhead: Isolation boundary preventing failures from spreading
  • Circuit Breaker: Pattern that stops calling a failing dependency after threshold exceeded
  • Degradation Hierarchy: Ordered list of fallback behaviors from full to minimal functionality
  • Jitter: Random variation added to timing to prevent synchronized behavior
  • Load Shedding: Rejecting excess requests to maintain latency for admitted requests
  • Retry Budget: Cap on retries as percentage of normal traffic (typically 10%)
  • Thundering Herd: Many clients simultaneously retrying or reconnecting after an outage
  • Graceful degradation defines explicit intermediate states between healthy and failed, with documented trade-offs at each level
  • Circuit breakers prevent cascade failures by stopping calls to failing dependencies; tune thresholds based on traffic patterns, not defaults
  • Load shedding protects capacity by rejecting low-priority work early; define priority tiers and never shed critical paths
  • Bulkheads contain blast radius through isolation (thread pools, cells, pods); the trade-off is resource duplication
  • Retries need exponential backoff with jitter plus retry budgets (< 10%) to prevent amplification
  • Fallbacks must be tested regularly through chaos engineering—untested fallbacks don’t work when needed

Foundational Resources:

Pattern Implementations:

Production Case Studies:

Failure Analysis:

Read more

  • Previous

    Multi-Region Architecture

    System Design / Core Distributed Patterns 19 min read

    Building systems that span multiple geographic regions to achieve lower latency, higher availability, and regulatory compliance. This article covers the design paths—active-passive, active-active, and cell-based architectures—along with production implementations from Netflix, Slack, and Uber, data replication strategies, conflict resolution approaches, and the operational complexity trade-offs that determine which pattern fits your constraints.

  • Next

    Virtualization and Windowing

    System Design / Frontend System Design 20 min read

    Rendering large lists (1,000+ items) without virtualization creates a DOM tree so large that layout calculations alone can block the main thread for hundreds of milliseconds. Virtualization solves this by rendering only visible items plus a small buffer, keeping DOM node count constant regardless of list size. The trade-off: implementation complexity for consistent O(viewport) rendering performance.