24 min read Last updated on Feb 8, 2026

Shopify: Pod Architecture for Multi-Tenant Isolation at Scale

How Shopify evolved from a single-database Rails monolith to a pod-based architecture that isolates millions of merchants into self-contained units — surviving 284 million edge requests per minute during Black Friday 2024 while maintaining sub-10-second failover. This case study examines why sharding alone wasn’t enough, how pods enforce blast radius containment, the zero-downtime migration tooling that moves shops between pods in seconds, and the organizational patterns that let 1,000+ developers deploy 40 times a day to a 2.8-million-line Ruby codebase.

Mermaid diagram
Shopify's pod architecture: stateless workers are shared, stateful stores are isolated per pod, and cross-pod data flows only through an eventually consistent CDC pipeline.

A pod is a fully isolated set of datastores — MySQL shard, Redis, Memcached, cron runners — that serves a subset of shops with zero cross-pod communication. Stateless workers (app servers, job runners) are shared and route to the correct pod via a shop_id header injected by the Sorting Hat (Lua on OpenResty). The core insight: sharding databases horizontally solved write throughput, but left shared resources (Redis, Memcached) as single points of failure. The “Redismageddon” incident (~2016) — where one Redis failure took down all of Shopify — proved that isolation at the data layer, not just partitioning, was the missing design constraint.

Key architectural properties:

  • Blast radius containment: A pod failure affects only the shops on that pod, not the entire platform
  • Independent failover: Each pod has an active + recovery datacenter pair; failover completes in ~1 minute via a Slack command
  • Zero-downtime rebalancing: Ghostferry (open-source, Go) moves shops between pods with 2.5-second average downtime using binlog tailing and formal verification (TLA+)
  • Noisy neighbor elimination: Resource-intensive merchants (celebrity flash sales) get dedicated pods
  • Operational independence: No pod-to-pod queries; cross-pod analytics flow through Debezium CDC into Kafka (65,000 records/second average, P99 < 10 seconds)

Shopify is a multi-tenant e-commerce platform built on Ruby on Rails. The monolith has been in continuous operation since 2006, never rewritten — only restructured.

MetricValue (2018)Value (2024-2025)
Merchants600,000+5.5+ million
Database pods100+100+ (larger capacity per pod)
Peak requests/minute (edge)4.8 million284 million
Peak requests/minute (app server)80 million
Ruby codebase2.8 million+ linesGrowing
Developers1,000+3,000+
Deploys per day~40~40
Production services~100400+

Tech stack: Ruby on Rails (monolith), MySQL (primary datastore, petabyte-scale), Redis, Memcached, Apache Kafka, Elasticsearch, Kubernetes on GKE (Google Kubernetes Engine), nginx + OpenResty (load balancing), Go and Lua (infrastructure tooling).

2015: Shopify hit the vertical scaling ceiling — it was no longer possible to buy a larger database server. The team implemented horizontal database sharding using shop_id as the partition key.

Sharding solved write throughput. But it introduced a new class of risk: when any individual shard failed, the operation it handled became unavailable across the entire platform. As shard count grew, the probability of at least one shard being down at any given moment increased linearly. Worse, critical shared resources — Redis for caching and session storage, Memcached for content caching — remained un-sharded. Every shop on the platform shared these instances.

A single Redis instance failure took down all of Shopify. Every shop, every storefront, every checkout — offline because of one crashed process. The team had sharded MySQL for throughput, but every shard still depended on the same Redis cluster. The architecture had traded one single point of failure (the database) for another (shared caching infrastructure).

This incident reframed the problem. The question shifted from “how do we scale throughput?” to “how do we contain blast radius?”

  • No rewrite: Shopify’s monolith is the product. Rewriting to microservices was not an option for a profitable, growing platform with 1,000+ developers shipping daily.
  • Zero-downtime requirement: Merchants depend on Shopify for revenue 24/7. Any migration must be invisible to shop owners.
  • Multi-tenancy is the business model: Shopify runs millions of shops on shared infrastructure. The architecture must handle extreme variance in tenant load — from a single-product hobby shop to Kylie Jenner’s flash sale generating thousands of orders per second.
  • Organizational velocity: ~400 commits merged to master daily (as of 2019). The architecture cannot slow down developer productivity.

Before pods, Shopify’s sharded architecture exhibited three failure patterns:

1. Cross-tenant blast radius: A failure in any shared resource (Redis, Memcached, load balancer config) affected every shop on the platform. Shopify’s SRE (Site Reliability Engineering) team couldn’t bound the impact of any single failure.

2. Noisy neighbor amplification: Celebrity product launches (flash sales) generate traffic spikes 100x normal. When Kylie Jenner drops a new product, the traffic spike for that one shop competes for shared Redis connections, Memcached capacity, and cron scheduling with every other merchant.

3. Failover granularity too coarse: Failing over the entire platform between datacenters is slow, risky, and affects all merchants. There was no way to fail over just the affected shops.

The root cause was architectural, not operational: sharding partitions data but doesn’t partition failure domains.

After database sharding, Shopify’s architecture looked like this:

Mermaid diagram

The MySQL shards were isolated from each other, but every other stateful component was shared. A shard failure affected operations for shops on that shard. A Redis failure affected all shops.

Why it wasn’t obvious: Database sharding is commonly treated as a complete horizontal scaling solution. The failure mode — shared auxiliary services becoming the new bottleneck — only manifests under specific failure conditions or extreme load variance between tenants. In normal operation, shared Redis and Memcached perform well. The problem is revealed by failures and extreme spikes, not by steady-state traffic.

Approach: Break the Rails monolith into independent services, each with its own datastore and scaling characteristics.

Pros:

  • Independent scaling per service
  • Technology flexibility per service
  • Smaller blast radius per service failure

Cons:

  • Massive engineering effort for a 2.8-million-line codebase with 1,000+ developers
  • Distributed systems complexity (distributed transactions, eventual consistency, service mesh)
  • Loss of Rails productivity benefits (ActiveRecord associations, shared models, unified deployment)
  • Years of migration work with uncertain payoff

Why not chosen: Shopify’s monolith is the product — the tight coupling enables rapid feature development. Tobi Lutke (CEO) and the engineering leadership explicitly rejected microservices in favor of keeping the monolith and isolating the infrastructure beneath it. The team observed that many companies’ microservices migrations created more operational complexity than they solved.

Approach: Give each merchant their own database, Salesforce-style.

Pros:

  • Perfect tenant isolation
  • Simple mental model

Cons:

  • 5.5 million separate databases is operationally untenable
  • Schema migrations across millions of databases take weeks or months
  • Connection pooling and resource management at that scale is unsolved
  • Most shops are small — a dedicated database per hobby shop wastes resources

Why not chosen: The variance in merchant size makes per-tenant databases impractical. A pod containing thousands of small shops and a dedicated pod for one large merchant provides better resource utilization.

Approach: Group shops into pods. Each pod gets its own complete set of stateful infrastructure (MySQL, Redis, Memcached, cron). Stateless workers remain shared and route to the correct pod per request.

Pros:

  • Blast radius bounded to pod size (typically thousands of shops, not millions)
  • Shared infrastructure benefits retained for stateless components
  • Incremental migration — shops move between pods without downtime
  • Noisy neighbors isolated by assigning them dedicated pods
  • Failover granularity at pod level (~1 minute) instead of platform level

Cons:

  • Cross-pod queries prohibited — analytics and reporting need separate solutions
  • Operational complexity of managing 100+ pods
  • Migration tooling must guarantee zero data loss during shop moves
  • Global features (platform-wide analytics, admin dashboards) require CDC pipelines

Why chosen: Pods solve the blast radius problem without abandoning the monolith. The architecture constrains where failures propagate rather than trying to prevent them entirely.

FactorMicroservicesPer-Tenant DBPod Architecture
Migration effortYears, high riskMonths, moderate riskMonths, low risk
Developer productivity impactHigh (new paradigm)LowLow
Blast radius containmentPer servicePer tenantPer pod (tunable)
Operational complexityVery highVery high at scaleModerate
Preserves monolith benefitsNoYesYes
Incremental adoptionDifficultDifficultNatural

Before (Sharded Monolith):

Mermaid diagram
Pre-pod architecture: MySQL shards provide write throughput scaling, but Redis, Memcached, and cron are shared across all shops — single points of failure.

After (Pod Architecture):

Mermaid diagram
Pod architecture: each pod is a complete, isolated set of stateful infrastructure. Dedicated pods serve high-traffic merchants. Stateless workers are shared and route via shop_id.

Key differences:

  1. Redis isolated per pod: A Redis failure in Pod 1 affects only Pod 1’s shops
  2. Memcached isolated per pod: Cache invalidation storms are pod-scoped
  3. Cron isolated per pod: Background jobs for one pod can’t starve another pod’s jobs
  4. Dedicated pods for large merchants: Kylie Jenner gets her own pod — her flash sale can’t affect other shops

The Sorting Hat is a Lua script running on nginx + OpenResty load balancers. It determines which pod should handle each incoming request.

Routing flow:

  1. Request arrives at the nginx load balancer
  2. Sorting Hat looks up the request’s host/domain in a routing table
  3. Injects a header identifying the target pod
  4. Forwards the request to the shared application worker pool
  5. The application worker reads the pod header and connects to the correct MySQL, Redis, and Memcached instances

Routing rules are stored as JSON payloads in a control plane, managed via a chatbot interface called spy. This makes routing changes operationally simple — an engineer can move a shop to a different pod by updating the routing table through a chat command.

Shopify runs several other Lua scripts on their OpenResty load balancers — what the team calls their “secret weapon for surviving spikes”:

ScriptFunction
Sorting HatRoutes requests to the correct pod
BotSquasherAnalyzes Kafka request streams to identify and block bot traffic using pattern detection
EdgeCacheBypasses the application stack entirely, serving cached content directly from Memcached at the load balancer level
PauserQueues requests during pod failover to prevent error responses reaching merchants

The Pauser script is particularly notable for failover: instead of returning errors while a pod moves between datacenters, the load balancer holds requests in a queue and replays them once the pod is available on the target datacenter. This is what enables sub-10-second perceived downtime during failover.

MySQL is the primary datastore, operating at petabyte scale. The sharding key is shop_id, which is attached to every shop-owned table.

Design decisions:

  • Application-level sharding: The Rails application determines which shard to query based on shop_id, not a database proxy. This keeps the routing logic in the same codebase where the queries are written.
  • One shard per pod: Each pod contains exactly one MySQL shard. This simplifies the mapping — pod identity equals shard identity.
  • shop_id on every table: Every table that stores shop-specific data includes a shop_id column. This is an invariant enforced across the codebase. Without it, the application can’t route queries to the correct pod.

The strict no-cross-pod-queries rule means global views of the data (analytics dashboards, platform-wide reporting) require a separate data path. Shopify built a Change Data Capture (CDC) pipeline using Debezium, Kafka, and Presto.

ComponentRole
DebeziumReads MySQL binlogs from each pod’s shard
Kafka Connect~150 connectors across 12 Kubernetes pods
Confluent Schema RegistryAvro schemas for change events
Apache KafkaCentral event bus (400+ TB of CDC data)
PrestoCross-pod analytical queries on the data warehouse

Pipeline numbers (2021):

  • 65,000 records/second average throughput
  • 100,000 records/second at peak
  • P99 latency under 10 seconds from MySQL insertion to Kafka availability

This pipeline is read-only and eventually consistent. There are no cross-pod writes. If an analytics dashboard shows data that is a few seconds stale, that’s acceptable. If a transactional operation needs data from another pod, it’s a sign the data model needs restructuring — not that the architecture needs cross-pod queries.

Ghostferry is Shopify’s open-source (Go) tool for moving shops between MySQL shards — and therefore between pods — without downtime.

Three-phase migration:

Phase 1 — Batch copy + binlog tailing: Ghostferry iterates through source tables, selecting rows by shop_id. Simultaneously, it streams the MySQL binary log, filtering for changes to the relevant shop’s data. Both operations run concurrently across multiple tables.

Phase 2 — Cutover: When the binlog queue reaches near real-time (seconds of lag), Ghostferry acquires application-level locks (multi-reader-single-writer, backed by Redis) to halt writes to the shop being moved. It records the final binlog coordinate as the stopping point.

Phase 3 — Routing update: The Sorting Hat’s routing table is updated to point the shop’s domain at the target pod. Locks are released, writes resume against the new pod. A verification suite confirms data integrity.

Performance: 2.5 seconds average downtime per shop (2018 numbers). Bart de Water reported at QCon 2022 that most stores experience under 10 seconds of downtime, with larger stores under 20 seconds.

Scale: Hundreds of thousands of shops are moved between pods every year for rebalancing.

Correctness: The central migration algorithm is formally specified in TLA+ (Temporal Logic of Actions). This formal verification caught edge cases that testing alone would not — particularly around race conditions when concurrent writes land on different shards during migration.

As traffic patterns change and new merchants onboard, pods can become imbalanced. The pod balancer uses historical database utilization and traffic data to:

  1. Classify shops by resource requirements
  2. Identify overloaded pods
  3. Move resource-intensive shops to less crowded pods
  4. Assign prominent merchants to dedicated pods for complete isolation

This is a continuous, data-driven process — not a one-time migration. The goal is to keep each pod’s resource utilization within acceptable bounds while ensuring noisy neighbors don’t share pods with many small merchants.

Each pod is paired with two datacenters: active and recovery. The Pod Mover tool moves a pod to its recovery datacenter in approximately one minute.

Operational model: Failover is triggered via a Slack command. An SRE types a command, and the pod’s traffic redirects to the recovery datacenter. The Pauser script queues in-flight requests during the switch, so merchants don’t see errors.

Datacenter evacuation: Moving an entire datacenter means evacuating each pod one at a time. Since each pod moves independently in ~1 minute, a full datacenter evacuation takes minutes, not hours.

In 2018, Shopify partnered with Google Cloud to migrate from physical datacenters to GKE. Key numbers from the migration:

  • 800,000+ tenants migrated
  • 50%+ of datacenter workloads moved to GCP (Google Cloud Platform)
  • 400+ production services consolidated on Kubernetes
  • Dale Neufeld (VP of Production Engineering) led the effort

The pod architecture made this migration tractable: each pod could be moved to GCP independently, validated, and rolled back if needed. Without pods, the migration would have been an all-or-nothing event.

Shopify built and open-sourced several tools for ensuring pod-level and platform-level resilience:

ToolPurposeNotes
SemianCircuit breaker + bulkheading for RubyIn production since October 2014. Prevents cascading failures by failing fast when downstream services are unhealthy.
ToxiproxyNetwork condition simulation and fault injection (Go)Simulates latency, partitions, and failures in test and staging.
GenghisInternal load generatorRuns scripted user workflows using Lua scripts across 3 GCP regions. Used to validate pod capacity before BFCM.
Resiliency MatrixOperational playbookDocuments service status, failure scenarios, recovery procedures with RTOs (Recovery Time Objectives), on-call coverage.

Parallel to the pod infrastructure work, Shopify restructured the monolith’s internal code organization using a “componentization” approach (initially called “Break-Core-Up-Into-Multiple-Pieces”).

2017: The team catalogued all 6,000 Ruby classes into a spreadsheet, manually labeling each class’s target component. They reorganized from the default Rails structure (models/views/controllers) to domain-driven organization. This was executed via automated scripts in a single large PR.

2020: Shopify released Packwerk, an open-source Ruby gem for static dependency analysis. The monolith was organized into 37 components with defined public entrypoints.

2024 retrospective: Packwerk became less central over time. A retrospective noted that the tool identified problems faster than teams could fix them, and the domain-based component boundaries sometimes misaligned with actual code dependencies. Privacy checks were removed in Packwerk v3.0 due to architectural misalignment.

The componentization story illustrates an important lesson: infrastructure isolation (pods) and code organization (components) are complementary but independent concerns. Pods solved the operational blast radius problem. Components addressed developer cognitive load. Both were necessary, but solving one didn’t solve the other.

Black Friday/Cyber Monday (BFCM) is Shopify’s annual stress test. The pod architecture’s impact is visible in the platform’s ability to handle exponentially growing traffic:

MetricBFCM 2018BFCM 2024
Total sales$1.5 billion$11.5 billion
Peak sales rate$4.6 million/minute
Edge requests (total)1.19 trillion
Database queries10.5 trillion
Database writes1.17 trillion
Data processed57.3 petabytes
Customers served76+ million
Platform uptime99.98%99.99%+
MetricBefore PodsAfter Pods
Blast radius of Redis failureAll shopsOne pod’s shops
Failover granularityEntire platformPer pod (~1 minute)
Flash sale isolationNone (noisy neighbor)Dedicated pod
Shop migration downtimeHours (manual)2.5 seconds (automated)
Datacenter evacuationHours, high riskMinutes (pod by pod)

Shopify’s readiness process demonstrates pod architecture maturity. The team runs escalating load tests in the months before BFCM:

TestPeak RPMPurpose
Test 1BaselineValidate monitoring
Test 4146 million RPMSustained load
P99 test200 million RPMExtreme spike
Peak checkout80,000+ checkouts/minutePayment path stress

These tests run against the actual production pod infrastructure. Each pod is independently validated for capacity. If a specific pod is underprovisioned, the pod balancer redistributes shops before BFCM.

  • Multi-region deployment: Pods made multi-region trivial — each pod can run in a different region based on merchant geography. This was impractical with a shared-everything architecture.
  • GCP migration tractability: The cloud migration to GKE could proceed pod by pod, with independent validation and rollback per pod.
  • Compliance isolation: Merchants with specific data residency requirements can be placed on pods in compliant regions.
  • No cross-pod transactions: Features that need data from multiple pods require eventual consistency via the CDC pipeline. This constrains certain product features.
  • Operational complexity: Managing 100+ pods requires sophisticated tooling. The pod balancer, Ghostferry, Pod Mover, and Sorting Hat are all critical infrastructure that must be maintained.
  • Schema migrations: Altering MySQL schemas across 100+ shards requires coordination. Shopify uses online schema change tools, but complex migrations still require careful orchestration.

The insight: Horizontal database sharding solves throughput scaling but does nothing for blast radius. Every shared resource above the shard layer (caches, queues, schedulers) remains a single point of failure. True multi-tenant isolation requires isolating the entire stateful stack — not just the database.

How it applies elsewhere:

  • If you’ve sharded your database but share Redis/Memcached across shards, you have the same vulnerability Shopify had pre-pods
  • SaaS platforms with per-tenant databases but shared cache layers are exposed to the same cross-tenant blast radius

Warning signs to watch for:

  • A cache failure affects tenants on multiple shards
  • A single background job queue serves all shards
  • Failover is all-or-nothing, not per-shard

The insight: Pods sacrifice some efficiency (duplicated Redis and Memcached per pod) for isolation. This is the correct tradeoff for multi-tenant systems at Shopify’s scale. The cost of duplicated caches is far less than the cost of a platform-wide outage caused by a shared cache failure.

How it applies elsewhere:

  • Cell-based architecture (AWS, Azure) applies the same principle at cloud-provider scale
  • Kubernetes namespace isolation, while not as strong as Shopify’s pods, follows a similar philosophy

Warning signs to watch for:

  • Cost optimization efforts that consolidate shared resources across tenants
  • Arguments that “cache hit rates will be lower with smaller pools” — true, but irrelevant if the alternative is global outages

The insight: Ghostferry’s 2.5-second shop migration downtime enables continuous rebalancing. Without this capability, pod assignment would be static, leading to growing imbalances and eventually the same noisy-neighbor problem pods were designed to solve.

How it applies elsewhere:

  • Any sharded system that can’t rebalance tenants between shards will develop hotspots
  • The investment in live migration tooling pays dividends in operational flexibility for years

Warning signs to watch for:

  • Shard rebalancing requires maintenance windows
  • Moving a tenant between shards takes hours or involves data loss risk
  • Operations teams avoid rebalancing because it’s too risky

The insight: Shopify used TLA+ to formally specify Ghostferry’s migration algorithm. During the later Vitess migration (for the Shop app), the team encountered approximately 25 bugs — 6+ in Vitess itself — including race conditions with UPDATE row → INSERT row_2 → DELETE row sequences landing on different shards. Formal verification is particularly valuable for stateful operations where ordering and concurrency bugs can cause data loss.

How it applies elsewhere:

  • Any data migration tool handling concurrent reads and writes benefits from formal specification
  • TLA+ is especially suited to verifying invariants in distributed state machines

The insight: Shopify rejected the microservices migration path. Instead, they kept the monolith and isolated the infrastructure it runs on. This preserved developer productivity (~40 deploys/day, ~400 commits/day) while solving the operational isolation problem. The monolith provides a single deployment unit, unified type system, and shared abstractions that microservices sacrifice.

How it applies elsewhere:

  • Not every scaling problem requires decomposing the application. Sometimes the problem is infrastructure isolation, not code organization.
  • The “modular monolith” pattern (pods + componentized code) provides many benefits of microservices without the distributed systems overhead.

The insight: Redismageddon was the catalyst for pods. The most impactful architectural changes often emerge from specific production failures that reveal hidden assumptions. Shopify didn’t design pods in a vacuum — they designed pods because a concrete failure demonstrated that sharding alone was insufficient.

What they’d do differently: Simon Eskildsen and the team have noted that proactively identifying shared resources as blast radius risks — before an outage — would have been preferable. The Resiliency Matrix that Shopify now maintains is the systematic answer to this reactive pattern.

The insight: Pod failover via Slack command, shop moves via chatbot — Shopify built operational tooling that any on-call engineer can invoke without SSH access or deep infrastructure knowledge. This reduces MTTR by removing the “find the right person” bottleneck.

The insight: Pods (infrastructure isolation) and Packwerk (code modularity) solved different problems. Pods contained blast radius. Packwerk reduced cognitive load. Solving one didn’t solve the other. Teams that assume “if we fix the infrastructure, the code will follow” (or vice versa) will be disappointed.

How organization structure affected the outcome: Shopify’s production engineering team owned pod infrastructure independently from the application teams that used Packwerk for code organization. This separation of concerns allowed both efforts to progress without blocking each other.

You might benefit from a pod-like architecture if:

  • You run a multi-tenant platform where tenant traffic variance is high (10x-1000x between smallest and largest)
  • A single cache or queue failure currently affects all tenants
  • Failover is all-or-nothing for your entire platform
  • You need data residency isolation for compliance (GDPR, data sovereignty)
  • Your monolith works well for development but has operational blast radius problems
  • Single-tenant applications: Pods solve multi-tenant isolation. If you have one tenant, you have one pod.
  • Low scale / small team: Managing 100+ pods requires dedicated infrastructure engineering. If you have fewer than 10 engineers, the operational overhead outweighs the benefits.
  • Already decomposed into services: If you’ve already adopted microservices with per-service databases, you’ve solved (or created) a different set of problems. Pods are specifically for monolithic or shared-database architectures.
  • Can a single Redis/Memcached failure affect multiple tenants?
  • Is failover granularity coarser than you’d like?
  • Do noisy tenants (traffic spikes) affect other tenants?
  • Do you need per-tenant infrastructure isolation for compliance?
  • Can you move tenants between shards with zero or minimal downtime?

If you want to explore this approach:

  1. Map your shared stateful resources: Identify every component shared across tenant boundaries (cache, queue, scheduler, session store)
  2. Simulate a shared resource failure: What’s the blast radius of a Redis failure? A Memcached failure? A background job queue failure?
  3. Prototype a single “pod”: Pick one stateful resource (e.g., Redis) and isolate it for a subset of tenants. Measure the operational impact.
  4. Build routing first: A Sorting Hat equivalent (tenant-to-pod mapping at the load balancer) is the prerequisite for pod isolation. Without routing, you can’t direct tenants to isolated infrastructure.
  5. Invest in migration tooling early: The value of pods depends on the ability to rebalance tenants between them. If moving a tenant requires downtime, you’ll resist rebalancing, and pods will drift into imbalance.

Shopify’s pod architecture demonstrates that the monolith vs. microservices debate presents a false dichotomy. The scaling problem wasn’t in the application layer — it was in the infrastructure’s failure domain boundaries. By isolating stateful resources into pods while keeping the monolith intact, Shopify achieved blast radius containment, per-pod failover, and noisy neighbor elimination without sacrificing the developer productivity that a unified codebase provides.

The critical lesson is the distinction between throughput scaling and isolation scaling. Sharding solved throughput. Pods solved isolation. Both are necessary at Shopify’s scale, and conflating them — as many teams do when they equate “sharding” with “multi-tenant isolation” — leads to architectures that handle normal load well but fail catastrophically under partial failures.

  • Understanding of database sharding and horizontal partitioning strategies
  • Familiarity with multi-tenant SaaS architecture patterns
  • Basic knowledge of load balancer request routing
  • Understanding of MySQL replication and binary logs
  • Pod: A fully isolated set of stateful infrastructure (MySQL, Redis, Memcached, cron) serving a subset of shops
  • Sorting Hat: Lua script on nginx/OpenResty that routes requests to the correct pod based on host/domain lookup
  • Ghostferry: Shopify’s open-source Go tool for zero-downtime MySQL data migration between shards/pods
  • Pod Mover: Internal tool for failing a pod over to its recovery datacenter in ~1 minute
  • Pod Balancer: Automated system that redistributes shops between pods based on historical resource utilization
  • CDC (Change Data Capture): Pipeline that streams database changes from pod shards into a central Kafka topic for cross-pod analytics
  • Redismageddon: Internal name for the ~2016 incident where a single Redis failure took down all of Shopify
  • BFCM: Black Friday/Cyber Monday — Shopify’s annual peak traffic event
  • Packwerk: Open-source Ruby gem for enforcing modular boundaries within the monolith’s codebase
  • Semian: Open-source Ruby gem implementing circuit breakers and bulkheading to prevent cascading failures
  • Shopify hit vertical scaling limits in 2015 and implemented database sharding, but shared Redis/Memcached remained single points of failure
  • The “Redismageddon” incident (~2016) — a single Redis failure taking down all shops — proved that sharding partitions data but not failure domains
  • Pods isolate the complete stateful stack (MySQL, Redis, Memcached, cron) per subset of shops, containing blast radius to a single pod
  • The Sorting Hat (Lua on OpenResty) routes requests to the correct pod by injecting a pod header based on domain lookup
  • Ghostferry enables zero-downtime shop migration between pods (2.5-second average downtime), with its core algorithm formally verified in TLA+
  • Pod Mover fails over individual pods between datacenters in ~1 minute, triggered via Slack command
  • The architecture scaled from 600K merchants and 80K RPS (2018) to 5.5M merchants and 284M edge RPM (2024) while maintaining 99.99%+ uptime during BFCM
Continue Reading
  • Previous

    AWS Kinesis 2020 Outage: Thread Limits, Thundering Herds, and Hidden Dependencies

    System Design / Real-World Case Studies 20 min read

    How a routine capacity addition to Amazon Kinesis Data Streams in US-EAST-1 exceeded an OS thread limit on every front-end server, triggering a 17-hour cascading failure that took down CloudWatch, Lambda, Cognito, and dozens of other AWS services on November 25, 2020—the day before Thanksgiving. This incident is a case study in O(N²) scaling patterns, untested failsafes, and the consequences of monitoring systems that depend on the services they monitor.

  • Next

    Netflix: From Monolith to Microservices — A 7-Year Architecture Evolution

    System Design / Real-World Case Studies 23 min read

    In August 2008, a database corruption in Netflix’s monolithic Oracle backend prevented DVD shipments for three days — exposing a single point of failure that threatened the business. Rather than patching the existing architecture, Netflix leadership made a radical decision: migrate entirely to AWS and decompose the monolith into independent microservices. Over 7 years (2009–2016), Netflix grew from 9.4 million to 89 million subscribers, scaled from 20 million to 2 billion API requests per day, and built an open-source ecosystem (Eureka, Hystrix, Zuul, Chaos Monkey) that redefined how the industry thinks about cloud-native architecture. This case study traces the technical decisions, migration phases, tools built, and hard-won lessons from one of the most influential architecture transformations in software history.