Real-World Case Studies

Learning from production incidents and architectural decisions.

All Articles (20 articles)

Facebook 2021 Outage: BGP Withdrawal, DNS Collapse, and the Backbone That Disappeared

System Design / Real-World Case Studies 24 min read

How a routine maintenance command on Facebook’s backbone routers accidentally withdrew all BGP (Border Gateway Protocol) routes for AS32934, making Facebook’s authoritative DNS servers unreachable for ~6 hours on October 4, 2021. With 3.5 billion combined users across Facebook, Instagram, WhatsApp, and Messenger unable to resolve any Meta domain, this incident exposed what happens when DNS infrastructure, remote management, physical security, and internal tooling all depend on the same network backbone—and that backbone vanishes.

AWS Kinesis 2020 Outage: Thread Limits, Thundering Herds, and Hidden Dependencies

System Design / Real-World Case Studies 20 min read

How a routine capacity addition to Amazon Kinesis Data Streams in US-EAST-1 exceeded an OS thread limit on every front-end server, triggering a 17-hour cascading failure that took down CloudWatch, Lambda, Cognito, and dozens of other AWS services on November 25, 2020—the day before Thanksgiving. This incident is a case study in O(N²) scaling patterns, untested failsafes, and the consequences of monitoring systems that depend on the services they monitor.

Shopify: Pod Architecture for Multi-Tenant Isolation at Scale

System Design / Real-World Case Studies 24 min read

How Shopify evolved from a single-database Rails monolith to a pod-based architecture that isolates millions of merchants into self-contained units — surviving 284 million edge requests per minute during Black Friday 2024 while maintaining sub-10-second failover. This case study examines why sharding alone wasn’t enough, how pods enforce blast radius containment, the zero-downtime migration tooling that moves shops between pods in seconds, and the organizational patterns that let 1,000+ developers deploy 40 times a day to a 2.8-million-line Ruby codebase.

Netflix: From Monolith to Microservices — A 7-Year Architecture Evolution

System Design / Real-World Case Studies 23 min read

In August 2008, a database corruption in Netflix’s monolithic Oracle backend prevented DVD shipments for three days — exposing a single point of failure that threatened the business. Rather than patching the existing architecture, Netflix leadership made a radical decision: migrate entirely to AWS and decompose the monolith into independent microservices. Over 7 years (2009–2016), Netflix grew from 9.4 million to 89 million subscribers, scaled from 20 million to 2 billion API requests per day, and built an open-source ecosystem (Eureka, Hystrix, Zuul, Chaos Monkey) that redefined how the industry thinks about cloud-native architecture. This case study traces the technical decisions, migration phases, tools built, and hard-won lessons from one of the most influential architecture transformations in software history.

Twitter/X: Timeline Architecture and the Recommendation Algorithm

System Design / Real-World Case Studies 21 min read

How Twitter evolved from a monolithic Ruby on Rails app delivering reverse-chronological tweets at 4,600 writes/second into a distributed ML pipeline that processes 500 million tweets daily through a multi-stage recommendation system — and how X replaced all of it with a Grok-based transformer in 2026. This case study traces the full arc: fanout-on-write, the hybrid celebrity problem, the algorithmic timeline controversy, the unprecedented open-sourcing of the recommendation algorithm, and the latest Phoenix/Thunder architecture.

Facebook TAO: The Social Graph’s Distributed Cache

System Design / Real-World Case Studies 14 min read

Facebook’s social graph presents a unique data access problem: billions of users generating trillions of edges (friendships, likes, comments) that must be readable at sub-millisecond latencies and writable at millions of operations per second. TAO (The Associations and Objects) replaced the generic memcache+MySQL architecture with a graph-aware caching system that handles over 1 billion reads per second at 96.4% cache hit rate. This case study examines why generic key-value caches failed for graph data, how TAO’s two-tier write-through architecture solves thundering herd and consistency problems, and what design patterns emerge for caching highly connected data.

Uber: From Monolith to Domain-Oriented Microservices

System Design / Real-World Case Studies 20 min read

How Uber evolved from two monolithic services to 4,000+ microservices and then restructured into domain-oriented architecture, demonstrating that the hardest part of microservices is not splitting the monolith but managing what comes after. Each architectural phase solved real scaling bottlenecks while creating new organizational and operational challenges at the next order of magnitude.

Uber Schemaless: Building a Scalable Datastore on MySQL

System Design / Real-World Case Studies 13 min read

In early 2014, Uber faced an infrastructure crisis: trip data growth was consuming database capacity so rapidly that their systems would fail by year’s end without intervention. Rather than adopting an existing NoSQL solution, Uber built Schemaless—a thin, horizontally scalable layer on top of MySQL that prioritized operational simplicity over feature richness. This case study examines the architectural decisions that enabled Schemaless to scale from crisis point to billions of trips, the trade-offs inherent in its append-only cell model, and the design patterns that made MySQL work as a distributed datastore.

Slack: Scaling a Real-Time Messaging Platform from Monolith to Distributed Architecture

System Design / Real-World Case Studies 22 min read

How Slack evolved from a PHP monolith with workspace-sharded MySQL into a distributed system handling 2.3 million queries per second (QPS) across Vitess-managed databases, 4 million concurrent WebSocket connections through a global edge cache, and a cellular infrastructure design achieving 99.99% availability—all without a single big-bang rewrite. This case study traces the architectural decisions behind each layer: data storage, real-time messaging, edge caching, and reliability infrastructure.

LinkedIn and the Birth of Apache Kafka: Solving the O(N²) Data Integration Problem

System Design / Real-World Case Studies 17 min read

In 2010, LinkedIn faced a data infrastructure crisis: connecting 10 specialized data systems required 90 custom pipelines, each prone to failure. Oracle SQL*Loader jobs took 2-3 days with manual babysitting. Despite significant engineering effort, only 14% of their data reached Hadoop. Traditional messaging systems like ActiveMQ required a full TCP/IP roundtrip per message—unacceptable for billions of daily events. Jay Kreps, Neha Narkhede, and Jun Rao built Kafka to solve this: a distributed commit log that decoupled data producers from consumers, enabling any-to-any data flow through a single, scalable pipeline. This case study examines the architectural decisions that made Kafka the backbone of modern data infrastructure.

Dropbox Magic Pocket: Building Exabyte-Scale Blob Storage

System Design / Real-World Case Studies 22 min read

How Dropbox migrated 500+ petabytes off AWS S3 onto custom infrastructure in under two years, saving $74.6 million net while achieving higher durability than the service it replaced. Magic Pocket is a content-addressable, immutable block store built by a team of fewer than six engineers, now serving 700+ million users across 600,000+ storage drives.

Instagram: From Redis to Cassandra and the Rocksandra Storage Engine

System Design / Real-World Case Studies 21 min read

How Instagram migrated critical workloads from Redis to Apache Cassandra, achieving 75% cost savings, then engineered a custom RocksDB-based storage engine to eliminate JVM garbage collection stalls and reduce P99 read latency by 10x. A seven-year evolution from 12 nodes to 1,000+ nodes serving billions of operations daily.

YouTube: Scaling MySQL to Serve Billions with Vitess

System Design / Real-World Case Studies 23 min read

How YouTube built Vitess to horizontally scale MySQL from 4 shards to 256 across tens of thousands of nodes, serving millions of queries per second without abandoning the relational model—and why they eventually migrated to Spanner anyway.

GitHub: Scaling MySQL from One Database to 1,200+ Hosts

System Design / Real-World Case Studies 20 min read

How GitHub evolved its MySQL infrastructure from a single monolithic database to a fleet of 1,200+ hosts across 50+ clusters serving 5.5 million queries per second (QPS)—through vertical partitioning, custom tooling (gh-ost, orchestrator, freno), Vitess adoption, and a major version upgrade—while keeping github.com available throughout.

Pinterest: MySQL Sharding from Zero to Billions of Objects

System Design / Real-World Case Studies 25 min read

In 2012, Pinterest had 3.2 million users doubling every 45 days, 3 engineers, and five different database technologies all breaking simultaneously. Their fix: abandon every NoSQL database, shard MySQL with a 64-bit ID scheme that embeds shard location directly in every object ID, and never move data between shards. This case study examines how Pinterest’s “boring technology” philosophy produced one of the most enduring database architectures in Silicon Valley---still running in production over a decade later---and the specific design decisions that made it work at 150+ billion objects.

Stripe: Idempotency for Payment Reliability

System Design / Real-World Case Studies 14 min read

How Stripe prevents double charges and enables safe retries across billions of transactions using idempotency keys, atomic phases, and database-backed state machines. This case study examines the design decisions behind Stripe’s approach—why they chose database transactions over distributed consensus, how they handle foreign state mutations, and the patterns that enabled 99.999% uptime while processing $1 trillion in payments.

Discord: Rewriting Read States from Go to Rust

System Design / Real-World Case Studies 22 min read

How Discord eliminated periodic latency spikes in their most heavily accessed service by rewriting it from Go to Rust—and why the garbage collector, not the application code, was the bottleneck. The Read States service tracks which channels every user has read across billions of states, and its performance directly affects every connection, every message send, and every message acknowledgment on the platform.

Discord: From Billions to Trillions of Messages — A Three-Database Journey

System Design / Real-World Case Studies 23 min read

How Discord evolved its message storage from a single MongoDB replica set to Cassandra to ScyllaDB over 8 years, building Rust-based data services and a custom “super-disk” storage architecture along the way — reducing cluster size from 177 to 72 nodes while dropping p99 read latency from 125ms to 15ms.

Figma: Building Multiplayer Infrastructure for Real-Time Design Collaboration

System Design / Real-World Case Studies 23 min read

How Figma built a real-time collaboration engine that supports 200 concurrent editors per document using a CRDT-inspired, server-authoritative protocol — rejecting both Operational Transformation and pure CRDTs in favor of property-level last-writer-wins with fractional indexing, backed by a Rust multiplayer server and a DynamoDB write-ahead journal processing over 2.2 billion changes per day.

WhatsApp: 2 Million Connections Per Server with Erlang

System Design / Real-World Case Studies 26 min read

How WhatsApp scaled from zero to 1 billion users on Erlang/BEAM and FreeBSD — with fewer than 50 engineers, ~800 servers, and a custom protocol that compressed messages to 20 bytes — by pushing per-server density to limits most teams never attempt.