System Design Fundamentals
14 min read

Load Balancer Architecture: L4 vs L7 and Routing

How load balancers distribute traffic, terminate TLS, and maintain availability at scale. This article covers the design choices behind L4/L7 balancing, algorithm selection, health checking, and session management—with real-world examples from Netflix, Google, and Cloudflare.

Health & Control

Backend Pool

Application Layer

Edge Layer

Clients

probes

probes

probes

status

registry

Client 1

Client 2

Client N

DNS / GeoDNS

L4 Load Balancer

TCP/UDP routing

L7 Load Balancer

HTTP routing, TLS termination

Backend 1

Backend 2

Backend N

Health Checker

Service Discovery

Two-tier load balancing: L4 for connection distribution, L7 for intelligent request routing. Health checkers continuously verify backend availability.

Load balancing operates at two fundamentally different layers:

  • L4 (Transport): Routes TCP/UDP connections by 5-tuple (protocol, IPs, ports). Latency: 50-100 μs. Throughput: 10-40 Gbps per node. No payload inspection—just packet forwarding.
  • L7 (Application): Parses HTTP/gRPC, routes by URL/header/cookie. Latency: 0.5-3 ms. Throughput: 1-5 Gbps. Enables content-aware routing but requires full request parsing.

Most production systems use both: L4 at the edge for raw throughput, L7 behind it for intelligent routing.

The key design decisions:

DecisionOptionsPrimary Factor
L4 vs L7TCP routing vs HTTP routingNeed for content inspection
AlgorithmRound robin, least conn, consistent hash, P2CAccess pattern + statefulness
TLSTerminate, passthrough, re-encryptSecurity requirements vs performance
SessionSticky, shared store, statelessHorizontal scaling needs

L4 load balancers operate on TCP/UDP connection metadata without inspecting payloads. The routing decision uses the 5-tuple: protocol, source IP, source port, destination IP, destination port.

How it works:

  1. Client initiates TCP handshake to load balancer VIP (Virtual IP)
  2. LB selects backend using configured algorithm
  3. LB either NATs the connection (changes dest IP) or uses Direct Server Return (DSR)
  4. All packets for that connection flow to the same backend

Performance characteristics:

  • Latency overhead: 50-100 microseconds
  • Throughput: 10-40 Gbps per commodity server
  • Connection table size: Millions of concurrent connections
  • CPU usage: Minimal—no payload parsing

Direct Server Return (DSR):

In DSR mode, the load balancer only handles inbound traffic. Responses go directly from backend to client, bypassing the LB entirely. This eliminates the return-path bottleneck but requires:

  • L4 only (no application-layer features)
  • Kernel configuration on backends (disable ARP on loopback)
  • Loss of TLS termination capability

Best for: High-volume TCP services (databases, message queues), UDP workloads (DNS, gaming, video streaming), and TLS passthrough where backend decryption is required.

L7 load balancers terminate the TCP connection and parse the application protocol (HTTP, gRPC, WebSocket). Routing decisions use request content: URL path, headers, cookies, method.

How it works:

  1. Client completes TCP + TLS handshake with LB
  2. LB receives full HTTP request
  3. LB inspects headers/path and selects backend
  4. LB opens new connection to backend (connection pooling)
  5. LB forwards request, possibly modifying headers

Performance characteristics:

  • Latency overhead: 0.5-3 milliseconds (TLS + HTTP parsing)
  • Throughput: 1-5 Gbps per server
  • Memory: Higher—must buffer requests
  • CPU: Significant—TLS termination, header parsing

Capabilities enabled by L7:

FeatureHow It Works
Path-based routingRoute /api/* to API servers, /static/* to CDN
Header-based routingRoute by Host, Authorization, custom headers
Request transformationAdd/remove headers, rewrite URLs
Rate limitingLimit by client ID, API key, IP
A/B testingRoute percentage of traffic by cookie
Circuit breakingStop routing to failing backends

Best for: HTTP APIs, microservices, WebSocket applications, anything requiring content-aware routing or TLS termination.

FactorChoose L4Choose L7
ProtocolNon-HTTP (TCP/UDP raw)HTTP/HTTPS/gRPC
Throughput need> 10 Gbps< 5 Gbps acceptable
Latency sensitivitySub-millisecond requiredMilliseconds acceptable
Routing complexitySimple round robinContent-based routing
TLS handlingPassthrough to backendTerminate at LB
Connection reuseNot neededBeneficial (connection pooling)

Production pattern: Most architectures use both. L4 at the edge distributes across L7 pools. L7 handles application routing. Example: AWS NLB (L4) fronting ALB (L7), or Google’s Maglev (L4) fronting Envoy (L7).

Mechanism: Distribute requests to backends in sequence. Backend 1, then 2, then 3, then back to 1.

When to use:

  • Homogeneous backend capacity
  • Request durations are similar
  • No state requirements

Trade-offs:

  • Simple, deterministic, minimal overhead
  • Even distribution over time
  • Ignores backend load—slow backends get same traffic
  • Long requests can pile up on one server

Weighted variant: Assign weights proportional to capacity. A server with weight 3 gets 3x the requests of weight 1.

Mechanism: Route to the backend with fewest active connections.

When to use:

  • Variable request durations
  • Database connections (long-lived)
  • Backends with different capacities (weighted)

Trade-offs:

  • Adapts to current load dynamically
  • Handles slow requests gracefully
  • Requires connection tracking state
  • Doesn’t account for connection “weight” (one heavy query vs many light ones)

Weighted formula:

score = active_connections / weight

Route to server with lowest score.

Real-world: HAProxy’s leastconn is standard for database connection pools where query times vary from 1ms to 10s.

Mechanism: Hash the request key (client IP, user ID, session ID) to a position on a ring. Route to the next server clockwise on the ring.

When to use:

  • Session affinity without cookies
  • Cache locality (same client hits same cache)
  • Distributed systems with topology changes

Trade-offs:

  • Minimal remapping when servers join/leave (only K/N keys move)
  • Deterministic—same key always hits same server
  • Uneven distribution without virtual nodes
  • Hot keys create hot servers

Virtual nodes: Each physical server maps to 100-1000 positions on the ring. Discord uses 1000 virtual nodes per physical node, achieving <5% load variance after node failures.

Real-world example—DynamoDB: Uses consistent hashing for partition distribution. Hot partition keys remain a top operational issue—the algorithm can’t fix uneven key access patterns.

Mechanism: Select 2 random healthy backends, route to the one with fewer active requests.

When to use:

  • Large backend pools (50+ servers)
  • Need simplicity with good load distribution
  • High request rates where O(N) scans are expensive

Trade-offs:

  • O(1) algorithm—constant time regardless of pool size
  • Avoids herding (multiple LBs won’t pick same “best” server)
  • Nearly as effective as full least-connections scan
  • Slightly less optimal than true least-connections

Why it works: The “power of two choices” is a fundamental result in randomized algorithms. Choosing from 2 random options exponentially reduces maximum load compared to choosing 1 random option.

Real-world: Envoy Proxy uses P2C as its default algorithm. Netflix’s Zuul uses a P2C variant with probation for overload mitigation.

Mechanism: Consistent hashing with a lookup table optimized for speed. Maps 5-tuple to backend via a permutation table.

When to use:

  • Software load balancers at massive scale
  • Need consistent hashing with faster lookup than ring hash
  • Connection-oriented protocols (connection must stay on same backend)

Trade-offs:

  • Sub-millisecond lookup times
  • Minimal disruption on backend changes
  • Even distribution across backends
  • Slightly less stable than ring hash on churn
  • Complex to implement correctly

Real-world: Google’s Maglev handles millions of connections per second across 100+ backend clusters. The lookup table provides O(1) access while maintaining consistent hashing properties.

FactorRound RobinLeast ConnConsistent HashP2C
Request duration variabilityLowHighAnyHigh
Backend countAnyAnyAny50+
State requirementNoneConnection trackingNoneConnection tracking
Session affinityNoNoYes (by key)No
Computational costO(1)O(N) or O(log N)O(1)O(1)
Handles hot spotsNoPartiallyNoPartially

How it works: LB opens TCP connection to backend port. If SYN-ACK received within timeout, backend is healthy.

Parameters:

  • Interval: 5-30 seconds typical
  • Timeout: 2-10 seconds
  • Unhealthy threshold: 2-5 consecutive failures

Limitations: Only verifies network connectivity. Backend may accept TCP but fail HTTP requests (app crashed but socket still bound).

How it works: LB sends HTTP request (typically GET or OPTIONS) to health endpoint. Expects 2xx/3xx response.

Parameters:

  • Path: /health, /healthz, /ready
  • Expected status: 200-399
  • Response body validation (optional)

Best practice: Health endpoint should verify critical dependencies:

GET /health
{
"status": "healthy",
"database": "connected",
"cache": "connected",
"uptime_seconds": 3600
}

But keep checks fast—health endpoints shouldn’t query databases on every call.

Active: LB probes each backend at intervals regardless of traffic.

  • More responsive to failures
  • Adds load to backends (N probes × M backends)
  • Can detect failures before user traffic hits them

Passive: LB monitors real traffic responses.

  • Zero additional load
  • Only detects failures when traffic flows
  • Can track gradual degradation (rising error rates)

Production pattern: Use both. Active checks for fast failure detection. Passive checks for anomaly detection (backend returning 500s but responding to health checks).

The total time to detect and react to a failure:

Detection time = (interval × unhealthy_threshold) + timeout

With interval=5s, threshold=3, timeout=2s: 17 seconds worst case.

For faster detection:

SettingAggressiveBalancedConservative
Interval2s5s30s
Timeout1s2s5s
Unhealthy threshold235
Detection time5s17s155s

Trade-off: Aggressive settings increase false positives (transient network blips mark healthy servers as down) and health check load.

When removing a backend (deployment, scaling down), abrupt removal causes in-flight requests to fail.

Connection draining process:

  1. LB stops sending NEW requests to backend
  2. Existing connections continue until complete or timeout
  3. After drain timeout, LB forcibly closes remaining connections
  4. Backend removed from pool

Parameters:

  • Drain timeout: 30-300 seconds typical (AWS default: 300s)
  • In-flight request grace period

Real-world: Zero-downtime deployments require drain timeouts longer than your longest request. If you have 60-second report generation requests, drain timeout must be > 60s.

When a backend recovers or new capacity is added, all LBs may route traffic to it simultaneously, causing immediate overload.

Mitigation strategies:

  1. Slow start: Gradually increase traffic to recovered backends over 30-60 seconds
  2. Connection limits: Cap connections per backend during ramp-up
  3. Health check jitter: Randomize health check timing across LBs

DNS-level thundering herd: When DNS TTL expires after a failover, all clients resolve simultaneously and hit the same (recovered) endpoint. Mitigation: Use 30-60 second TTLs for failover scenarios, implement staggered TTLs.

How it works: LB decrypts TLS, inspects HTTP, forwards plaintext to backends.

Advantages:

  • Enables L7 features (routing, transformation, logging)
  • Centralizes certificate management
  • Offloads CPU from backends (TLS is expensive)
  • Enables connection pooling to backends

Disadvantages:

  • LB has access to plaintext traffic
  • LB becomes a performance bottleneck for TLS
  • Certificates must be stored on LB (security surface)

Performance: Modern LBs use hardware acceleration (AES-NI, dedicated TLS ASICs). AWS NLB handles 2.5M TLS handshakes/second per node.

How it works: LB forwards encrypted traffic unchanged. Backend terminates TLS.

Advantages:

  • End-to-end encryption (LB never sees plaintext)
  • Backends control cipher suites, certificates
  • Simpler LB (L4 only)

Disadvantages:

  • No L7 features (routing must be by IP/port)
  • Each backend manages certificates
  • No connection pooling

Use case: Compliance requirements where LB cannot access plaintext (PCI DSS, healthcare).

How it works: LB terminates client TLS, inspects content, establishes new TLS connection to backend.

Advantages:

  • Full L7 features
  • Backend-to-LB traffic encrypted
  • Can use different certificates (internal PKI)

Disadvantages:

  • Double TLS overhead (decrypt + re-encrypt)
  • ~20-30% more CPU than termination-only
  • Key management complexity

Use case: Zero-trust networks, sensitive data requiring encryption in transit everywhere.

RequirementRecommendation
Need L7 routingTerminate or re-encrypt
Compliance: LB can’t see dataPassthrough
Internal network untrustedRe-encrypt
Maximum performanceTerminate only
Backend certificate rotationTerminate (centralize certs)

Production pattern: 68% of enterprise deployments use termination at LB with plaintext to backends in trusted networks. Re-encryption is growing with zero-trust adoption.

How it works: LB injects cookie identifying backend server. Subsequent requests with that cookie route to same backend.

Types:

  • Duration-based: LB-generated cookie with TTL
  • Application-controlled: App sets cookie, LB reads it

Example (HAProxy):

backend servers
cookie SERVERID insert indirect nocache
server s1 10.0.0.1:8080 cookie s1
server s2 10.0.0.2:8080 cookie s2

How it works: Hash client IP to backend. Same IP always hits same backend.

Problems:

  • NAT: Thousands of users behind same IP
  • Mobile: IP changes as users move
  • IPv6: Clients may have multiple addresses

How it works: Route based on custom header (user ID, session ID, tenant ID).

Use case: Multi-tenant systems where tenant data is sharded.

Benefits:

  • Session data stays local (no replication needed)
  • Better cache hit rates
  • Simpler for stateful protocols (WebSocket, gRPC streaming)

Problems:

  • Uneven load: Server with long sessions gets overloaded
  • Failure impact: Sticky server dies = all its sessions lost
  • Scaling difficulty: Can’t easily move sessions during scale-out
  • DDoS vulnerability: Attackers can concentrate load
ApproachHow It WorksTrade-off
Shared session storeRedis/Memcached stores sessionsNetwork latency per request
Stateless tokensJWT or encrypted session in cookieToken size, can’t revoke easily
Client-side storageLocalStorage/SessionStorageLimited size, security considerations
Database session tableSession in PostgreSQL/MySQLHigher latency, DB load

Modern recommendation: Avoid sticky sessions for horizontally-scaled services. Use stateless design or shared session stores. Reserve stickiness for truly stateful protocols (WebSocket where server holds connection state).

Scale: 80+ Zuul clusters, 100+ backend clusters, 1M+ requests/second.

Architecture:

  1. AWS ELB (L4) distributes to Zuul instances
  2. Zuul (L7) handles routing, auth, rate limiting
  3. Ribbon (client-side) balances to backend services

Algorithm choice: P2C with probation. Servers returning errors enter probation—they receive reduced traffic until recovery.

Zone-aware routing: Zuul tracks success rate per availability zone. If zone-A has 90% success and zone-B has 70%, traffic shifts to zone-A.

Why this design: Netflix needs to handle correlated failures (entire AZ goes down). Zone-aware routing provides automatic failover without DNS changes.

Scale: Millions of connections/second, hundreds of backends per service.

Design decisions:

  • L4 only: Maglev doesn’t parse HTTP. Speed is paramount.
  • Consistent hashing: 5-tuple hash ensures connection stability
  • No state replication: Each Maglev instance is stateless. ECMP distributes across Maglevs.
  • Lookup table: 65,537-entry permutation table for O(1) backend selection

Why consistent hashing: Connection-oriented protocols (HTTP/2, gRPC) multiplex requests over long-lived connections. Breaking connections means re-establishing TLS, losing request context.

Trade-off accepted: Maglev doesn’t provide L7 features. Those are handled by Envoy proxies behind Maglev.

Approach: Announce same IP addresses from all 300+ data centers via BGP. Internet routing delivers users to nearest DC.

Why no traditional LBs:

  • BGP handles geographic distribution for free
  • Each DC runs identical stack
  • No single point of failure

Challenge: TCP over anycast is tricky—route changes mid-connection break sessions. Cloudflare engineered TCP to handle route flaps.

Trade-off: Requires massive PoP presence. Works for Cloudflare’s edge network, not for typical enterprise deployments.

Network Load Balancer (NLB):

  • L4 only, handles 1M+ connections/second
  • Preserves source IP (no NAT in passthrough mode)
  • Static IPs (important for firewall rules)
  • 100ms latency percentile targets

Application Load Balancer (ALB):

  • L7, HTTP/HTTPS parsing
  • Content-based routing, path/header matching
  • WebSocket and HTTP/2 support
  • Connection pooling to backends

When AWS recommends NLB: Non-HTTP protocols, extreme scale, need for static IPs, latency-sensitive workloads.

When AWS recommends ALB: HTTP APIs, microservices, need for content routing, WebSocket support.

The mistake: Health endpoint returns 200 but service can’t handle real traffic (database disconnected, cache full, dependencies down).

Why it happens: Health check only verifies process is running, not that it can serve requests.

The consequence: LB routes traffic to “healthy” server that fails every request.

The fix: Health endpoint should verify critical path. At minimum, check database connectivity and essential dependencies. But keep it fast—don’t run full queries.

The mistake: Terminate backends immediately during deployments.

Why it happens: Drain timeout slows deployments. Developers disable it for faster iteration.

The consequence: In-flight requests get 502/503 errors. Users see failures during every deploy.

The fix: Set drain timeout >= your longest expected request. For long-running requests, implement graceful shutdown in the application.

The mistake: Session affinity with no handling for backend failure.

Why it happens: Session stored only on sticky backend, no replication.

The consequence: When sticky backend fails, user loses session (logged out, cart emptied).

The fix: Either replicate sessions to shared store, or accept session loss and ensure graceful re-authentication.

The mistake: DNS TTL of 3600s (1 hour) for load-balanced services.

Why it happens: Default TTL values, reduces DNS query load.

The consequence: After failover, clients continue hitting dead endpoint for up to 1 hour.

The fix: Use 30-60 second TTLs for services requiring fast failover. Accept increased DNS query volume as the trade-off.

The mistake: Load balancer keeps routing to overloaded backends.

Why it happens: Health checks pass (backend responds) but backend is saturated.

The consequence: Cascade failure—one slow backend makes all requests slow via head-of-line blocking.

The fix: Implement connection limits per backend. Use circuit breakers. Monitor latency percentiles, not just health status.

Load balancer architecture requires matching the technology to your constraints:

  • L4 vs L7: Choose based on throughput needs vs routing intelligence. Use both for large systems—L4 at edge, L7 behind.
  • Algorithm: Round robin for homogeneous workloads, least connections for variable request durations, consistent hashing for stateful routing, P2C for large pools.
  • TLS: Terminate at LB unless compliance requires passthrough. Re-encrypt only when internal network is untrusted.
  • Session affinity: Avoid unless truly necessary. Use shared session stores for horizontal scaling.
  • Health checks: Active for fast failure detection, passive for anomaly tracking. Tune aggressiveness to your availability requirements.

The recurring theme: there’s no universally correct answer. Netflix needs zone-aware routing for regional failures. Google needs sub-millisecond consistent hashing at massive scale. Cloudflare eliminated traditional LBs entirely. Your architecture depends on your specific scale, latency requirements, and operational capabilities.

  • TCP/IP networking fundamentals (three-way handshake, connection states)
  • TLS handshake process
  • HTTP/1.1 and HTTP/2 basics
  • DNS resolution
  • L4 load balancers route by connection (50-100μs overhead, 10-40 Gbps throughput)
  • L7 load balancers route by request content (0.5-3ms overhead, enables content-aware routing)
  • Algorithm choice depends on request patterns: round robin for uniform, least-conn for variable, consistent hash for stateful
  • Health checks should verify service capability, not just process liveness
  • Connection draining is required for zero-downtime deployments
  • Sticky sessions impede horizontal scaling; prefer shared session stores

Read more

  • Previous

    Transactions and ACID Properties

    System Design / System Design Fundamentals 22 min read

    Database transactions provide the foundation for reliable data operations: atomicity ensures all-or-nothing execution, consistency maintains invariants, isolation controls concurrent access, and durability guarantees persistence. This article explores implementation mechanisms (WAL, MVCC, locking), isolation level semantics across major databases, distributed transaction protocols (2PC, 3PC, Spanner’s TrueTime), and practical alternatives (sagas, outbox pattern) for systems where traditional transactions don’t scale.

  • Next

    CDN Architecture and Edge Caching

    System Design / System Design Fundamentals 15 min read

    How Content Delivery Networks reduce latency, protect origins, and scale global traffic distribution. This article covers request routing mechanisms, cache key design, invalidation strategies, tiered caching architectures, and edge compute—with explicit trade-offs for each design choice.