AWS Kinesis 2020 Outage: Thread Limits, Thundering Herds, and Hidden Dependencies
How a routine capacity addition to Amazon Kinesis Data Streams in US-EAST-1 exceeded an OS thread limit on every front-end server, triggering a 17-hour cascading failure that took down CloudWatch, Lambda, Cognito, and dozens of other AWS services on November 25, 2020—the day before Thanksgiving. This incident is a case study in O(N²) scaling patterns, untested failsafes, and the consequences of monitoring systems that depend on the services they monitor.
Abstract
The core mental model for this incident is a three-stage failure amplification:
| Stage | What Happened | Why It Was Hard to Fix |
|---|---|---|
| Trigger | Adding servers to the front-end fleet pushed per-server OS thread count past the configured limit | Thread-per-peer gossip protocol scaled O(N) per server, O(N²) aggregate—the limit was latent until capacity growth crossed it |
| Cascade | Kinesis front-end servers could not build shard-map caches, killing request routing for the entire fleet | The front-end was monolithic (not cellularized)—every server depended on every other server, so the failure was fleet-wide |
| Recovery | Restarting servers competed for CPU/memory between shard-map construction and request processing | Restarting too fast triggered thundering herd: servers deemed unhealthy and removed, resetting progress. Only a few hundred servers per hour could safely rejoin |
The secondary failures are equally instructive: CloudWatch depended on Kinesis for metric ingestion, so monitoring went blind. Cognito’s “graceful degradation” buffer had a latent bug that blocked forever when full. The Service Health Dashboard could not post updates because its publishing tool authenticated through Cognito. Each layer of supposed fault isolation turned out to share a common dependency.
Context
The System
Amazon Kinesis Data Streams is AWS’s managed service for real-time data streaming. The architecture relevant to this incident has two layers:
- Front-end fleet: Many thousands of servers responsible for authentication, throttling, and routing requests to the correct back-end cluster. Each front-end server maintains a shard-map cache—a routing table containing fleet membership, shard ownership, and back-end health data.
- Back-end clusters: A cell-based architecture where individual cells own subsets of shards. The back-end was already cellularized at the time of the incident, providing isolation between cells.
The shard-map cache on each front-end server is populated from three sources:
- A microservice that vends fleet membership information
- DynamoDB for authoritative configuration
- Inter-server messaging (a gossip-like protocol) from other front-end servers
The critical architectural detail: each front-end server creates one OS thread for each other server in the fleet. This is used for the inter-server communication that propagates shard-map updates, cluster health information, and administrative changes. In a fleet of N servers, each server requires N-1 threads for gossip alone, plus threads for request processing and cache construction.
The Trigger
November 25, 2020, 2:44 AM PST: A routine, relatively small capacity addition to the Kinesis front-end fleet begins—likely in anticipation of Thanksgiving holiday traffic. The addition completes by 3:47 AM PST after roughly one hour of propagation across the fleet.
Constraints
- Fleet size: “Many thousands of servers” in the US-EAST-1 front-end fleet
- Thread scaling: Thread count per server scales linearly with fleet size (O(N) per server)
- Propagation time: New server discovery takes up to one hour to fully propagate
- Restart rate: Only “a few hundred servers per hour” can safely rejoin without triggering cascading failures
- Dependent services: CloudWatch, Lambda, Cognito, EventBridge, and dozens of others use Kinesis internally
The Problem
Symptoms
Between 3:47 AM and 5:15 AM PST, the newly added servers propagated through the fleet. As each existing server discovered new peers, it attempted to create additional OS threads. This pushed the total thread count on every front-end server past the maximum allowed by the operating system configuration.
When thread creation failed:
- Shard-map cache construction could not complete
- Front-end servers were left with incomplete (“useless”) shard-maps
- Servers could no longer route requests to the correct back-end clusters
- Kinesis put/get API calls began failing fleet-wide
At 5:15 AM PST, the first alarms fired for Kinesis read/write errors.
Incident Timeline
| Time (PST) | Event | Action |
|---|---|---|
| 2:44 AM | Capacity addition to front-end fleet begins | Automated routine operation |
| 3:47 AM | Capacity addition completes | — |
| ~3:47–5:15 AM | Thread limits exceeded; shard-map construction fails silently | Latent failure period—no alarms yet |
| 5:15 AM | First Kinesis error alarms fire; EventBridge begins experiencing API errors | Team paged |
| 6:15 AM | Lambda invocation errors begin (memory contention from CloudWatch metric buffering) | Investigation ongoing |
| 7:51 AM | Root cause narrowed to candidate set; full fleet restart determined necessary | Decision to restart |
| 9:39 AM | Root cause confirmed: OS thread limit exceeded on all front-end servers | ~4.5 hours after first alarm |
| 10:07 AM | First front-end servers successfully return to handling traffic | ~2 hours after restart decision |
| 10:15 AM | Cognito mitigation deployment begins | — |
| 10:36 AM | Lambda error rates resolved | — |
| 12:15 PM | Cognito shows significant improvement | — |
| 2:18 PM | Cognito returns to normal | ~9 hours after first alarm |
| 4:15 PM | Majority of ECS/EKS issues resolved | ~11 hours after first alarm |
| 5:47 PM | CloudWatch errors begin declining | — |
| 10:23 PM | Kinesis fully recovered | ~17 hours after first alarm |
| 10:31 PM | CloudWatch fully recovered; metric data backfill begins | — |
Root Cause Analysis
Investigation process:
- 5:15–7:51 AM: The team investigated multiple hypotheses. The shard-map construction failure was evident, but the underlying cause—thread exhaustion—was not immediately obvious because the symptoms (bad routing data) pointed to the cache layer, not the OS.
- 7:51 AM: The team narrowed the root cause to a set of candidates. Regardless of which candidate was correct, any likely cause would require a full restart of the front-end fleet. They made the decision to begin restarting.
- 9:39 AM: Root cause confirmed—the OS thread limit had been exceeded on every front-end server.
The actual root cause: The thread-per-peer gossip protocol created O(N) thread usage per server. When N increased past a threshold where N-1 gossip threads plus request-handling threads exceeded the OS-configured maximum, all servers failed simultaneously. The failure was silent during the ~90-minute propagation window because shard-map construction fails gracefully (it produces incomplete data rather than crashing), and the alarms that would have detected the degradation had a delay before firing.
The Specific OS Thread Limit
AWS did not disclose the exact numeric limit. Analysis from multiple sources points to Linux-level thread constraints:
/proc/sys/kernel/pid_max— defaults to 32,768 on older kernels (each thread consumes a PID)ulimit -u(RLIMIT_NPROC) — maximum user processes/threads- cgroup limits — if the fleet ran in containers
The Downtime Project’s analysis suggests the default Linux thread limit of ~32K was the likely threshold. The precise number matters less than the structural issue: any thread-per-peer design has a hard ceiling that grows closer as the fleet grows.
Why It Wasn’t Obvious
- The capacity addition was small: This was not a dramatic change. The fleet was already near the thread limit; the addition simply crossed it.
- The failure was latent: Thread limits were exceeded during the ~90-minute gossip propagation window, not at the instant of the capacity change. The gap between cause and visible symptoms was ~2.5 hours.
- Shard-map failures were silent: Rather than crashing, servers produced incomplete routing data. This looked like a cache or metadata issue, not a thread exhaustion issue.
- The monitoring system was also affected: CloudWatch—the primary observability tool—depended on Kinesis, creating a partial blindness during investigation.
Detection Gap
Why it took ~2.5 hours to detect (2:44 AM trigger → 5:15 AM first alarm):
The capacity addition itself was a normal operation. The thread limit was exceeded gradually as gossip propagated. Shard-map construction degraded silently rather than failing loudly. The detection gap existed because:
- There was no alarm on per-server thread count approaching the OS limit
- Shard-map health monitoring did not alert on incomplete (but non-empty) maps
- The metric pipeline that would surface these signals was itself degrading
Detection improvements made post-incident:
- Fine-grained thread consumption alarming—proactive alerts before limits are reached
- Testing of increased OS thread limit configurations
- Dedicated monitoring that does not depend on Kinesis
Recovery
Why Initial Recovery Was Difficult
The recovery was constrained by a fundamental resource contention problem. On each front-end server, the resources used to construct the shard-map cache compete with the resources used to process incoming requests. When a server restarts:
- It must build a fresh shard-map by querying DynamoDB and communicating with peers
- While building the map, it must also handle incoming traffic
- If too many servers restart simultaneously, shard-map construction consumes most CPU and memory
- Servers that cannot process requests quickly enough are marked unhealthy by health checks
- Unhealthy servers are removed from the fleet
- This resets recovery progress—a classic thundering herd loop
Recovery Strategy
The team implemented a controlled, multi-step recovery:
- Rolled back the capacity addition — removed the newly added servers to restore the fleet to its previous (known-good) size, ensuring the thread limit would not be exceeded on restarted servers
- Changed the metadata fetching strategy — configured servers to fetch shard-map data directly from DynamoDB (the authoritative store) rather than relying on the gossip protocol, reducing thread requirements during cold start
- Controlled rolling restart — brought servers back at the rate of a few hundred per hour, carefully monitoring each batch to ensure shard-map construction completed before adding more
- 10:07 AM: First servers successfully returned to traffic (~2 hours after the restart decision at 7:51 AM)
- Progressive improvement through the day as more servers came online
- 10:23 PM: Full Kinesis recovery—approximately 12.5 hours after restarts began
The Instance Type Decision
A critical corrective action, implemented during and after the incident: AWS migrated the front-end fleet to larger CPU and memory server instances. The reasoning:
- Fewer total servers needed to handle the same load → fewer servers in the fleet
- Fewer servers → fewer threads per server (since thread count scales with fleet size)
- More CPU/memory per server → shard-map construction and request processing can coexist without resource contention
- Greater headroom before hitting OS thread limits
This addresses the immediate problem but does not fix the structural O(N²) scaling—that required cellularization (see Corrective Actions).
Dependent Services Impact
The Kinesis outage revealed a hidden dependency graph across AWS. Services that used Kinesis internally—often for analytics, logging, or event routing—experienced cascading failures with distinct failure modes.
CloudWatch: The Monitoring System Goes Blind
How it depended on Kinesis: CloudWatch uses Kinesis as a buffer for PutMetricData and PutLogEvents (CloudWatch Logs). When Kinesis failed, metric data could not be ingested.
Impact:
PutMetricDataandPutLogEventsAPI calls returned errors- CloudWatch Alarms transitioned to INSUFFICIENT_DATA state because no metric data was flowing
- Reactive AutoScaling policies that depended on CloudWatch metrics stopped triggering
- AWS internal teams lost visibility into their own services
The meta-irony: The monitoring system’s dependency on Kinesis meant that during the outage, operators had degraded ability to observe what was happening across their infrastructure. This blind spot extended to customers: teams could see their services failing but could not correlate the failures through their CloudWatch dashboards.
Recovery: 5:47 PM PST early signs of recovery; 10:31 PM full recovery. Metric data was backfilled subsequently.
Post-incident fix: AWS deployed a 3-hour local metrics data store in US-EAST-1 (with global rollout planned) so CloudWatch can persist recent metric data locally even when Kinesis is unavailable.
Cognito: The Latent Buffer Bug
How it depended on Kinesis: Amazon Cognito used Kinesis to ship API usage analytics data.
The failure: When Kinesis became unavailable, Cognito’s analytics buffer was designed to absorb the backlog gracefully. Instead, a latent bug in the buffering code caused it to block forever when the buffer filled. Cognito webservers hung on backlogged buffers, causing elevated API failures and increased latencies for User Pools, Identity Pools, and external authentication flows (social sign-in, SAML, OIDC).
This is a textbook example of an untested failsafe. The buffer was designed for Kinesis unavailability, but the scenario (prolonged, complete Kinesis failure) had never been tested. The buffer’s blocking behavior transformed a non-critical analytics dependency into a service-killing failure.
Recovery: Mitigation deployment began at 10:15 AM; significant improvement by 12:15 PM; normal operations restored at 2:18 PM PST.
Post-incident fix: Cognito webservers modified to sustain Kinesis API errors without exhausting buffers; internal buffer capacity increased.
Lambda: Memory Contention from Metric Buffering
How it depended on Kinesis: Indirectly, through CloudWatch. Lambda host systems buffer CloudWatch metric data locally. When CloudWatch ingestion backed up (because Kinesis was down), the buffered data caused memory contention on Lambda host systems, crowding out actual function execution resources.
Impact: Elevated Lambda invocation errors starting at 6:15 AM PST.
Recovery: Mitigation actions resolved the issue by 10:36 AM PST.
EventBridge: Event Processing Delays
How it depended on Kinesis: Amazon EventBridge (CloudWatch Events) relies on Kinesis for event processing pipelines.
Impact: Increased API errors and delays in event processing starting at 5:15 AM PST. This cascaded to ECS and EKS, which rely on EventBridge for orchestration events (cluster provisioning, scaling, task de-provisioning).
Recovery: ECS/EKS majority resolved by 4:15 PM PST (~11 hours after errors started).
The Service Health Dashboard: Can’t Report the Outage
How it depended on Kinesis: Indirectly, through Cognito. The tool AWS operators use to post updates to the public Service Health Dashboard (status.aws.amazon.com) authenticated through Cognito. With Cognito down, operators could not post status updates.
A backup manual tool with fewer dependencies existed, but operators were unfamiliar with its procedures. The result: for an extended period, the official status page showed services as healthy while customers experienced widespread failures.
Post-incident fix: Enhanced training on backup Service Health Dashboard tools; development of a manual backup tool with minimal service dependencies.
Other Affected Services
The following services reported degradation based on status page updates and third-party reporting:
| Service | Dependency Path |
|---|---|
| Amazon Data Firehose | Direct Kinesis dependency |
| AWS IoT Services | Reported to use Kinesis internally |
| Amazon S3 (event notifications) | EventBridge dependency |
| AWS CloudFormation | Multiple AWS API dependencies |
| Amazon Redshift | CloudWatch/logging dependencies |
| Amazon SageMaker | Multiple AWS service dependencies |
| AWS Glue | CloudWatch/logging dependencies |
| Amazon Workspaces | Cognito/authentication dependencies |
Lessons Learned
Technical Lessons
1. O(N²) Scaling Patterns Are Time Bombs
The insight: Any system where every node must connect to every other node creates O(N²) aggregate resource consumption. In Kinesis’s case, N servers each creating N-1 threads meant total thread count across the fleet was N×(N-1). This scales quadratically, but the limit is reached linearly on each individual server. As the fleet grows, the remaining headroom before hitting the per-server limit shrinks—and a “small” capacity addition can cross the threshold.
How it applies elsewhere:
- Service meshes with full-mesh connectivity between sidecar proxies
- Gossip protocols where every node communicates with every other node
- Coordination services (ZooKeeper, etcd) where leader election or consensus involves all participants
- Database connection pools where every application server holds a connection to every database node
Warning signs to watch for:
- Per-node resource consumption (threads, connections, file descriptors) that grows with cluster size
- Capacity additions that cause disproportionate resource growth on existing nodes
- Resource usage approaching OS or hardware limits that was “fine” at smaller scale
2. Untested Failsafes Are Worse Than No Failsafes
The insight: Cognito’s analytics buffer was explicitly designed for Kinesis unavailability. But because it was never tested against a prolonged, complete Kinesis failure, it contained a latent bug (blocking when full) that transformed a non-critical analytics dependency into a service-killing failure. A service without any buffer would have simply dropped analytics data and continued serving requests.
How it applies elsewhere:
- Circuit breakers that have never tripped in production
- Fallback code paths that have never been exercised under realistic conditions
- Graceful degradation logic that was code-reviewed but never chaos-tested
- Retry policies with backoff that have never been tested under sustained failure
Warning signs to watch for:
- Fallback paths that exist in code but have no production metrics (hit rate, latency, error rate)
- Dependencies classified as “non-critical” that have never actually been unavailable
- Buffer implementations without bounded capacity and non-blocking overflow behavior
3. Monitoring Must Not Depend on the Things It Monitors
The insight: CloudWatch’s dependency on Kinesis created a blindness loop—the system designed to detect failures was itself impaired by the failure. This delayed both detection and recovery for AWS internal teams and customers.
How it applies elsewhere:
- Application monitoring that routes through the same load balancers, networks, or services as production traffic
- Alerting systems that depend on the message queues they are supposed to monitor
- Health check infrastructure that shares fate with the services it checks
- Logging pipelines that fail silently when the services they log to are down
Warning signs to watch for:
- Your monitoring/alerting pipeline shares any infrastructure component with production services
- During past incidents, you noticed gaps in your observability data
- Your status page or incident communication tools authenticate through production services
4. Cell-Based Architecture Contains Blast Radius
The insight: Kinesis’s back-end was already cellularized and survived the incident. The front-end was monolithic—every server connected to every other server—so the failure was fleet-wide. Cellularization would have bounded the failure to a single cell, affecting only a fraction of traffic.
How it applies elsewhere:
- Any stateless fleet where every node must know about every other node
- Services where a configuration change or deployment rolls across the entire fleet simultaneously
- Systems where a single bad deployment or configuration can affect 100% of capacity
5. Control Plane Operations Must Not Take Down the Data Plane
The insight: A capacity change (control plane operation) brought down request processing (data plane). The control plane and data plane shared the same OS-level resources (threads). Strict separation—where control plane operations cannot exhaust data plane resources—would have prevented the outage.
How it applies elsewhere:
- Any service where administrative operations (scaling, configuration, deployment) run on the same instances as request processing
- Database systems where schema changes or replication can starve query processing
- Message queues where topic creation or partition rebalancing affects message delivery
Process Lessons
1. Know Your Recovery Rate Before You Need It
What happened: The Kinesis team knew that a full fleet restart was possible but slow. The actual rate—a few hundred servers per hour in a fleet of many thousands—meant recovery took 12+ hours from the point of decision.
What this means: For any large stateful fleet, calculate the worst-case recovery time. If it is unacceptable (as 12+ hours would be for most teams), invest in faster cold-start procedures, dedicated cache warming infrastructure, or smaller blast radius through cellularization before the incident occurs.
2. Status Communication Infrastructure Must Be Independent
What happened: The Service Health Dashboard could not be updated because its publishing tool depended on Cognito, which depended on Kinesis. The backup tool existed but was unfamiliar to operators.
What this means: Test your incident communication tools during gamedays. Ensure they have zero dependency on the services they report on. Drill operators on backup procedures regularly.
Applying This to Your System
When This Pattern Applies
You might face similar challenges if:
- Your service uses a full-mesh communication pattern (every node connects to every other node)
- Your per-node resource consumption (threads, connections, file descriptors) scales with cluster size
- Your monitoring depends on the same infrastructure as your production services
- You have fallback or degradation code paths that have never been tested under prolonged failure
- Your control plane and data plane share resources on the same instances
Checklist for Evaluation
- Map your service’s per-node resource usage and how it scales with cluster size
- Identify per-node resource limits (OS thread limits, file descriptor limits, connection pool maximums) and current headroom
- Trace your monitoring pipeline’s dependencies—does it share fate with production?
- Identify all “non-critical” dependencies and test what happens when each is unavailable for 1+ hours
- Calculate worst-case fleet restart time and whether it meets your Recovery Time Objective (RTO)
- Verify your incident communication tools work when your core services are down
- Test your circuit breakers and fallback paths under sustained (not transient) failure
Starting Points
If you want to evaluate your risk:
- Audit per-node scaling: For each service, plot per-node resource consumption (threads, connections, memory) against cluster size. If any resource grows linearly with cluster size, you have an O(N²) aggregate pattern.
- Test your failsafes: Run a gameday where a “non-critical” dependency is unavailable for 2+ hours. Observe whether graceful degradation actually works or whether it blocks, leaks, or crashes.
- Map monitoring dependencies: Draw the dependency graph for your monitoring and alerting pipeline. Any shared node with production is a blind-spot risk.
- Measure cold-start time: Restart a single instance of your service and measure time-to-healthy. Multiply by fleet size divided by safe restart batch size. That is your worst-case recovery time.
Conclusion
The 2020 Kinesis outage is a study in how latent architectural decisions compound into systemic failures. The thread-per-peer gossip protocol was a reasonable design at smaller fleet sizes. The monolithic front-end fleet was simpler to operate than a cellularized one. Cognito’s analytics buffer was a thoughtful attempt at graceful degradation. CloudWatch’s use of Kinesis for metric ingestion was an efficient architectural choice.
Each decision was defensible in isolation. Together, they created a system where a small capacity addition could exceed a thread limit, collapse an entire fleet, blind the monitoring system, block an authentication service, and prevent operators from even posting status updates—all in a single correlated failure.
The transferable insight is not about thread limits or gossip protocols specifically. It is about recognizing when independently reasonable design choices create hidden coupling: O(N²) resource patterns that approach hard limits, monitoring that shares fate with production, failsafes that have never been tested under realistic failure, and recovery procedures that take longer than anyone has planned for. These patterns exist in most large systems. The question is whether you find them before a capacity change does.
Appendix
Prerequisites
- Understanding of distributed systems concepts (gossip protocols, shard-based partitioning, fleet management)
- Familiarity with OS-level resource limits (thread limits, process limits, file descriptors)
- Knowledge of cascading failure patterns (thundering herd, resource contention, correlated failures)
- Basic understanding of AWS service architecture (Kinesis, CloudWatch, Lambda, Cognito)
Terminology
| Term | Definition |
|---|---|
| Shard-map cache | A per-server routing table in Kinesis front-end servers containing fleet membership, shard ownership, and back-end health data |
| Front-end fleet | The layer of Kinesis servers that handle authentication, throttling, and request routing to back-end clusters |
| Gossip protocol | A peer-to-peer communication pattern where each node exchanges state with other nodes to propagate updates across the fleet |
| Cell-based architecture | A pattern where infrastructure is divided into independent cells (sub-fleets), each serving a subset of traffic, to contain blast radius |
| Thundering herd | A failure pattern where many processes simultaneously attempt the same resource-intensive operation, causing contention that prevents any of them from succeeding |
| Control plane | The management layer responsible for configuration, scaling, and orchestration (as opposed to the data plane that handles actual requests) |
| Data plane | The layer that processes actual customer requests (as opposed to the control plane that manages the system) |
| O(N²) scaling | A growth pattern where aggregate resource consumption grows quadratically with the number of nodes—common in full-mesh communication topologies |
| RLIMIT_NPROC | A Linux resource limit that caps the maximum number of processes (and threads) a user can create |
| pid_max | A Linux kernel parameter (/proc/sys/kernel/pid_max) that sets the maximum PID value, effectively capping the total number of concurrent processes/threads |
Summary
- A routine capacity addition to the Kinesis front-end fleet pushed per-server OS thread counts past the configured limit, because the gossip protocol created one thread per peer (O(N) per server, O(N²) aggregate)
- Every front-end server failed simultaneously—the fleet was monolithic (not cellularized), so there was no blast radius containment
- Recovery took ~17 hours because restarting servers too quickly triggered thundering herd effects: shard-map construction starved request processing, health checks failed, and servers were removed faster than they could rejoin
- CloudWatch depended on Kinesis for metric ingestion, creating a monitoring blind spot during the incident; Cognito’s untested buffer bug blocked authentication; the Service Health Dashboard could not post updates because it authenticated through Cognito
- AWS committed to cellularizing the front-end fleet, migrating to larger instances, deploying fine-grained thread alarming, and isolating CloudWatch onto a separate partitioned fleet
References
- Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region — AWS Official Post-Event Summary, November 2020. Primary source for all timeline, root cause, and corrective action details.
- AWS Post-Event Summaries — AWS index of all public post-event summaries.
- Guidance for Cell-Based Architecture on AWS — AWS architectural guidance directly influenced by lessons from this and similar incidents.
- Lessons from the AWS Kinesis Outage — Evan Jones. Analysis of O(N²) scaling, recovery constraints, and Cognito’s untested failsafe.
- AWS Kinesis Outage — LinuxCzar. Analysis of Linux kernel thread limits and symptom-based vs. cause-based alerting.
- Kinesis Hits the Thread Limit — The Downtime Project podcast. Detailed architectural reconstruction of the gossip protocol and thread model.
- AWS Kinesis Outage Analysis — Ryan Frantz. Analysis of dependency chains and the coupling of CloudWatch with EventBridge.
- AWS reveals it broke itself by exceeding OS thread limits — The Register. Reporting on the AWS postmortem with customer impact context.
- AWS admits to ‘severely impaired’ services in US-EAST-1 — The Register. Day-of reporting with affected service list and customer impact.