System Design Fundamentals
Core building blocks like caching, load balancing, queues, and consistency.
All Articles (20 articles)
-
Consistency Models and the CAP Theorem
System Design / System Design Fundamentals 15 min readUnderstanding consistency guarantees in distributed systems: from the theoretical foundations of CAP to practical consistency models, their trade-offs, and when to choose each for production systems.
-
Distributed Consensus
System Design / System Design Fundamentals 13 min readUnderstanding how distributed systems reach agreement despite failures—the fundamental algorithms, their design trade-offs, and practical implementations that power modern infrastructure.Consensus is deceptively simple: get N nodes to agree on a value. The challenge is doing so when nodes can fail, messages can be lost, and clocks can drift. This article explores why consensus is provably hard, the algorithms that solve it in practice, and how systems like etcd, ZooKeeper, and CockroachDB implement these ideas at scale.
-
Time and Ordering in Distributed Systems
System Design / System Design Fundamentals 19 min readUnderstanding how distributed systems establish event ordering without a global clock—from the fundamental problem of clock synchronization to the practical algorithms that enable causal consistency, conflict resolution, and globally unique identifiers.Time in distributed systems is not what it seems. Physical clocks drift, networks delay messages unpredictably, and there’s no omniscient observer to stamp events with “true” time. Yet ordering events correctly is essential for everything from database transactions to chat message display. This article explores the design choices for establishing order in distributed systems, when each approach makes sense, and how production systems like Spanner, CockroachDB, and Discord have solved these challenges.
-
Failure Modes and Resilience Patterns
System Design / System Design Fundamentals 18 min readDistributed systems fail in complex, often surprising ways. This article covers the taxonomy of failures—from clean crashes to insidious gray failures—and the resilience patterns that mitigate them. The focus is on design decisions: when to use each pattern, how to tune parameters, and what trade-offs you’re accepting.
-
Storage Choices: SQL vs NoSQL
System Design / System Design Fundamentals 20 min readChoosing between SQL and NoSQL databases based on data model requirements, access patterns, consistency needs, and operational constraints. This guide presents the design choices, trade-offs, and decision factors that drive storage architecture decisions in production systems.
-
Sharding and Replication
System Design / System Design Fundamentals 24 min readScaling data stores beyond a single machine requires two complementary strategies: sharding (horizontal partitioning) distributes data across nodes to scale writes and storage capacity; replication copies data across nodes to improve read throughput and availability. These mechanisms are orthogonal—you choose a sharding strategy independently from a replication model—but their interaction determines your system’s consistency, availability, and operational complexity. This article covers design choices, trade-offs, and production patterns from systems handling millions of queries per second.
-
Indexing and Query Optimization
System Design / System Design Fundamentals 19 min readDatabase indexes accelerate reads by trading write overhead and storage for faster lookups. This article covers index data structures, composite index design, query planner behavior, and maintenance strategies for production systems.
-
Transactions and ACID Properties
System Design / System Design Fundamentals 22 min readDatabase transactions provide the foundation for reliable data operations: atomicity ensures all-or-nothing execution, consistency maintains invariants, isolation controls concurrent access, and durability guarantees persistence. This article explores implementation mechanisms (WAL, MVCC, locking), isolation level semantics across major databases, distributed transaction protocols (2PC, 3PC, Spanner’s TrueTime), and practical alternatives (sagas, outbox pattern) for systems where traditional transactions don’t scale.
-
Load Balancer Architecture: L4 vs L7 and Routing
System Design / System Design Fundamentals 14 min readHow load balancers distribute traffic, terminate TLS, and maintain availability at scale. This article covers the design choices behind L4/L7 balancing, algorithm selection, health checking, and session management—with real-world examples from Netflix, Google, and Cloudflare.
-
CDN Architecture and Edge Caching
System Design / System Design Fundamentals 15 min readHow Content Delivery Networks reduce latency, protect origins, and scale global traffic distribution. This article covers request routing mechanisms, cache key design, invalidation strategies, tiered caching architectures, and edge compute—with explicit trade-offs for each design choice.
-
Caching Fundamentals and Strategies
System Design / System Design Fundamentals 16 min readUnderstanding caching for distributed systems: design choices, trade-offs, and when to use each approach. From CPU cache hierarchies to globally distributed CDNs, caching exploits locality of reference to reduce latency and backend load—the same principle, applied at every layer of the stack.
-
API Gateway Patterns: Routing, Auth, and Policies
System Design / System Design Fundamentals 18 min readCentralized traffic management for microservices: design choices for routing, authentication, rate limiting, and protocol translation—with real-world implementations from Netflix, Google, and Amazon.
-
Service Discovery and Registry Patterns
System Design / System Design Fundamentals 16 min readHow services find each other in dynamic distributed environments where instance locations change continuously. This article covers discovery models, registry design, health checking mechanisms, and production trade-offs across client-side, server-side, and service mesh approaches.
-
DNS Deep Dive
System Design / System Design Fundamentals 20 min readThe Domain Name System (DNS) is the distributed hierarchical database that maps human-readable domain names to IP addresses. Designed in 1983 by Paul Mockapetris (RFC 1034/1035), DNS handles billions of queries per day with sub-100ms latency globally—yet its design choices (UDP transport, caching semantics, hierarchical delegation) create operational nuances that affect failover speed, security posture, and load distribution. This article covers DNS internals, resolution mechanics, record types with design rationale, TTL strategies, load balancing approaches, security mechanisms (DNSSEC, DoH/DoT), and production patterns from major providers.
-
Queues and Pub/Sub: Decoupling and Backpressure
System Design / System Design Fundamentals 24 min readMessage queues and publish-subscribe systems decouple producers from consumers, enabling asynchronous communication, elastic scaling, and fault isolation. The choice between queue-based and pub/sub patterns—and the specific broker implementation—determines delivery guarantees, ordering semantics, and operational complexity. This article covers design choices, trade-offs, and production patterns from systems handling trillions of messages daily.
-
Event-Driven Architecture
System Design / System Design Fundamentals 21 min readDesigning systems around events rather than synchronous requests: when events beat API calls, event sourcing vs. state storage, CQRS trade-offs, saga patterns for distributed transactions, and production patterns from systems processing trillions of events daily.
-
RPC and API Design
System Design / System Design Fundamentals 14 min readChoosing communication protocols and patterns for distributed systems. This article covers REST, gRPC, and GraphQL—their design trade-offs, when each excels, and real-world implementations. Also covers API versioning, pagination, rate limiting, and documentation strategies that scale.
-
Rate Limiting Strategies: Token Bucket, Leaky Bucket, and Sliding Window
System Design / System Design Fundamentals 19 min readRate limiting protects distributed systems from abuse, prevents resource exhaustion, and ensures fair access. This article examines five core algorithms—their internal mechanics, trade-offs, and production implementations—plus distributed coordination patterns that make rate limiting work at scale.
-
Circuit Breaker Patterns for Resilient Systems
System Design / System Design Fundamentals 17 min readCircuit breakers prevent cascading failures by failing fast when downstream dependencies are unhealthy. This article examines design choices—threshold-based vs time-window detection, thread pool vs semaphore isolation, per-host vs per-service scoping—along with production configurations from Netflix, Shopify, and modern implementations in Resilience4j.
-
Capacity Planning and Back-of-the-Envelope Estimates
System Design / System Design Fundamentals 13 min readCapacity planning validates architectural decisions before writing code. This article covers the mental models, reference numbers, and calculation techniques that let you estimate QPS, storage, bandwidth, and server counts—transforming vague “we need to handle millions of users” into concrete infrastructure requirements.