Facebook 2021 Outage: BGP Withdrawal, DNS Collapse, and the Backbone That Disappeared
How a routine backbone-maintenance command at Facebook (now Meta) — meant to assess capacity, but waved through by a buggy audit tool — severed every inter-datacenter link in AS32934, caused edge facilities to withdraw the BGP (Border Gateway Protocol) routes for the prefixes hosting their authoritative DNS servers, and made Facebook, Instagram, WhatsApp, and Messenger unresolvable for ~6 hours on October 4, 2021. With ~3.5 billion combined users1 unable to resolve any Meta domain, this incident is the canonical demonstration of what happens when DNS, remote management, physical security, and internal tooling all share fate with the network backbone — and that backbone vanishes.
Abstract
The mental model for this incident is a dependency inversion collapse—where the systems designed to recover from failures (remote management, internal tools, physical security) all depended on the very infrastructure that failed:
| Layer | What Failed | Why Recovery Was Hard |
|---|---|---|
| Backbone | A maintenance command severed all inter-datacenter links | The audit tool meant to prevent this had a bug that let the command execute |
| BGP | Edge facilities auto-withdrew BGP routes because they could not reach data centers | This is a safety mechanism—advertising routes to unreachable servers causes black-hole traffic. The safety feature became the propagation mechanism for the outage. |
| DNS | Authoritative nameservers were running but unreachable—no BGP route existed to reach them | DNS TTLs expired within minutes; recursive resolvers worldwide returned SERVFAIL |
| Management | Remote access (SSH, OOB consoles) routed through the same backbone | With the backbone down, every remote recovery path was severed simultaneously |
| Physical | Badge readers, door locks, and security systems used the network | Engineers dispatched to data centers faced physical access delays |
The key insight is not that a configuration error caused an outage—that happens regularly. The insight is that every recovery mechanism shared fate with the failure domain. The backbone was a single dependency that, when removed, eliminated not just the service but every path to restoring the service. This is the canonical example of why out-of-band management must be truly out-of-band—not just a separate VLAN on the same physical network.
Context
The System
Facebook (now Meta) operates one of the largest internet infrastructures in the world. The architecture relevant to this incident spans three layers:
- Backbone network: Tens of thousands of miles of fiber-optic cables connecting Facebook’s data centers globally. This is Facebook’s private WAN (Wide Area Network)—distinct from the public internet. It carries inter-datacenter replication, internal service traffic, and management plane communications.
- Edge facilities (PoPs): Smaller facilities at the edge of the network, co-located with major internet exchange points. These serve as the interface between Facebook’s backbone and the public internet. Edge facilities run Facebook’s authoritative DNS servers and handle BGP peering with upstream providers and peer networks.
- Data centers: Large-scale compute facilities housing the servers that run Facebook, Instagram, WhatsApp, Messenger, and internal services. These connect to edge facilities exclusively through the backbone.
Facebook’s edge facilities perform a critical health check: they continuously verify that they can communicate with the data centers through the backbone. If an edge facility loses backbone connectivity, it automatically withdraws its BGP route advertisements. This is a deliberate safety mechanism—advertising routes to servers you cannot reach would cause a traffic black hole, where packets arrive but can never be delivered.
The Trigger
October 4, 2021, ~15:39 UTC: A maintenance command is issued to backbone routers. The command is intended to assess the availability of global backbone capacity. Instead, it severs all backbone connections between Facebook’s data centers.
Key metrics at the time:
| Metric | Value |
|---|---|
| Monthly active users (Facebook) | 2.89 billion |
| Combined users (FB + Instagram + WhatsApp + Messenger) | ~3.5 billion |
| BGP prefixes normally advertised by AS32934 | ~349 (133 IPv4 + 216 IPv6, per Catchpoint snapshot at 08:00 UTC)2 |
| Backbone span | Tens of thousands of miles of fiber3 |
| Revenue per hour (press estimate) | ~$13 million4 |
Constraints
- Backbone is the single transport: All inter-datacenter communication—production traffic, management, monitoring—traverses the backbone
- DNS on the edge: Facebook’s authoritative DNS servers live in edge facilities that depend on backbone connectivity to declare themselves healthy
- Out-of-band management: Remote management tools (SSH, console servers) route through the same backbone infrastructure
- Physical security: Badge readers and door access controls at data centers use the corporate network
The Problem
Symptoms
At 15:39 UTC, backbone routers began processing the maintenance command. Within minutes:
- All backbone links between data centers went down
- Edge facilities detected loss of connectivity to data centers
- Edge routers executed their safety logic: withdrew BGP route advertisements for AS32934
- Facebook’s authoritative DNS servers—physically operational in edge facilities—became unreachable from the public internet
- DNS resolvers worldwide began returning SERVFAIL for all Meta-owned domains
From a user perspective: Facebook, Instagram, WhatsApp, and Messenger simultaneously became unreachable. No loading spinner, no error page—just connection timeouts and DNS resolution failures.
Incident Timeline
All times in UTC. External observatories (Cloudflare 1.1.1.1, RIPE NCC RIS, Kentik Synthetics, Catchpoint) and Meta’s own postmortem agree on the major beats; the precise minute boundaries are taken from RIPEstat’s BGPlay timeline of prefix 129.134.30.0/245.
| Time (UTC) | Event | Impact |
|---|---|---|
| ~15:39 | Maintenance command issued to backbone routers; audit tool fails to block it | — |
| 15:40–15:42 | First BGP UPDATE spike from AS32934; route withdrawals begin | Cloudflare and Catchpoint alarms trigger |
| ~15:44 | Short DNS TTLs begin expiring at recursive resolvers worldwide | Cached Facebook records evaporate within ~5 min6 |
| 15:53:47 | All AS paths to 129.134.30.0/24 (the prefix hosting a.ns.facebook.com) gone from BGPlay vantage points5 |
Facebook’s authoritative DNS is unreachable from the public internet |
| ~16:00 | HTTP 503 (proxy-status: no_server_available) visible to users that beat the cache; SERVFAIL from 1.1.1.1, 8.8.8.8, ISP resolvers78 |
All Meta properties down globally |
| ~16:00–17:00 | Workplace, internal wikis, dashboards, remote access all unavailable; engineers move to phones and external chat3 | Investigation forced onto out-of-band channels |
| ~17:00 onward | Engineers physically dispatched to data centers; secure-access protocols slow entry3 | On-site recovery begins |
| ~21:00 | First re-announcements appear in BGP; Cloudflare sees activity peak at 21:17 UTC7 | Initial recovery |
| ~21:05 | Catchpoint observes Meta services “gradually return to normality”8 | HTTP-level recovery begins |
| ~21:11 | Brief reannouncement instability visible in BGPlay5 | Minor hiccup as routes propagate |
| ~21:20 | facebook.com resolves again on Cloudflare’s 1.1.1.17 |
Public DNS recovers |
| ~21:28 | Cloudflare reports Facebook “appears to be reconnected to the global Internet”7; BGP stable5 | ~5h 50m total outage |
Root Cause Analysis
Investigation process:
- 15:39–16:00 UTC: External observers (Cloudflare, Kentik, RIPE NCC) detected the BGP withdrawal within minutes. Inside Facebook, engineers received alerts but quickly discovered that their investigation tools were also down—internal dashboards, remote router access, and communication platforms all depended on the backbone.
- 16:00–17:00 UTC: Engineers shifted to out-of-band communication (phone calls, non-Facebook messaging). They identified the backbone as the root failure but could not access backbone routers remotely—the management plane routed through the same backbone.
- 17:00+ UTC: Engineers were physically dispatched to data centers to access backbone routers via console connections.
The actual root cause: A maintenance command intended to assess backbone capacity contained an error that disconnected all backbone links. Facebook’s systems include an audit tool designed to catch commands that would cause this kind of damage. The audit tool had a bug that failed to flag this specific command as dangerous. The command executed, the backbone went down, and the automated BGP safety mechanism did the rest.
The BGP Cascade (Technical Detail)
Facebook operates AS32934. Under normal conditions, AS32934 originates ~349 IPv4 and IPv6 prefixes — Catchpoint counted 133 IPv4 and 216 IPv6 networks at 08:00 UTC2. Crucially, Meta’s authoritative DNS lives in some of those prefixes, originated specifically from edge facilities that depend on backbone reachability to declare themselves healthy.
Meta’s edge routers implement a health-check-based advertisement policy3:
- Edge facility continuously verifies backbone connectivity to data centers.
- If connectivity is confirmed: advertise BGP routes for the prefixes terminated at this edge (“you can reach this part of Facebook through me”).
- If connectivity is lost: withdraw those advertisements.
This is correct behavior. Advertising routes to an unreachable destination creates a black hole — packets arrive but cannot be delivered, consuming resources and causing silent failures. The withdrawal prevents this. The catastrophic part is that every edge facility lost backbone connectivity at the same instant, so every edge facility withdrew at the same instant, and the prefixes that mattered most — the DNS-hosting ones — disappeared from the global routing table together.
Important
Not every Facebook prefix was withdrawn. Catchpoint’s RIS data shows only 8 IPv4 and 14 IPv6 networks went away at the trigger time2. The handful that did go away happened to host the authoritative DNS, and DNS sits underneath every other interaction with Meta’s properties. In Catchpoint’s words: “it is not the quantity of networks that matters.”
Specific prefix example: a.ns.facebook.com resolves to 129.134.30.12, which lives in prefix 129.134.30.0/24 (covered by 129.134.30.0/23) originated by AS329349. When 129.134.30.0/24 was withdrawn, no router on the internet had a path to 129.134.30.12. The DNS server was running — it simply could not receive or respond to queries.
RIPE NCC’s BGPlay captured the withdrawal in real time: over ~10 minutes starting at 15:42 UTC, AS paths to 129.134.30.0/24 progressively disappeared, until at 15:53:47 UTC zero routes remained at the BGPlay vantage points5.
The DNS Cascade
With BGP routes withdrawn, the DNS failure was immediate and total:
- Authoritative DNS servers unreachable: Facebook’s nameservers (
a.ns.facebook.comthroughd.ns.facebook.com) were all hosted in Facebook’s own edge infrastructure. With BGP routes gone, no recursive resolver could reach them. - TTL expiration: DNS records have a TTL (Time to Live) specifying how long resolvers may cache them. Facebook’s DNS TTLs were relatively short. Within ~5 minutes of the BGP withdrawal, cached records began expiring at recursive resolvers worldwide.
- SERVFAIL storm: As TTLs expired, recursive resolvers (1.1.1.1, 8.8.8.8, ISP resolvers) attempted to refresh records by querying Facebook’s authoritative servers. Every query timed out. Resolvers returned SERVFAIL to clients.
- Retry amplification: Clients (apps, browsers) retried DNS queries aggressively. Cloudflare measured roughly 30× normal query volume at 1.1.1.1 during the incident7. That additional load spilled onto every shared resolver — public (1.1.1.1, 8.8.8.8) and ISP — and degraded resolution for unrelated domains on the smaller resolvers.
The HTTP-level symptoms: users’ browsers and apps received DNS resolution failures. Connections that beat the BGP withdrawal landed on Meta’s edge proxies, which then returned HTTP/2 503 with the header proxy-status: no_server_available8 — the edge proxy could not locate any upstream server because the backbone was severed.
Why It Wasn’t Obvious to Fix
- No remote access: The most natural response — SSH into backbone routers and revert the configuration — was impossible. Remote management traversed the backbone. With the backbone down, every remote path was severed3.
- Out-of-band management shared fate: Meta has stated explicitly that “our primary and out-of-band network access was down”3. The OOB plane that was supposed to be the escape hatch routed through enough of the same infrastructure that it failed with the primary network.
- Internal tools down: Workplace (internal Facebook for companies), wikis, ticketing, and monitoring dashboards all ran on Facebook’s own infrastructure. Engineers could not use their normal incident-response tools and fell back to phones and external messaging.
- Physical access friction: Data centers are “designed with high levels of physical and system security in mind … hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access”3. Secure-access protocols had to be activated manually before engineers could touch the routers.
Options Considered
For an outage of this nature, the recovery options are constrained:
Option 1: Remote Reconfiguration
Approach: SSH into backbone routers via the management network and revert the configuration change.
Why not possible: The management network depended on the backbone. With the backbone down, all remote access paths were severed. This was the first option attempted and the first eliminated.
Option 2: Out-of-Band Console Access
Approach: Use dedicated OOB management consoles (serial consoles, IPMI/iLO, dedicated management network) to access backbone routers remotely.
Why not possible: Facebook’s OOB infrastructure was not fully independent of the backbone network. The OOB paths still routed through infrastructure that depended on the backbone being operational. This is the critical design flaw the incident exposed.
Option 3: Physical Access and Manual Reconfiguration (Chosen)
Approach: Dispatch engineers to data center locations to physically access backbone router consoles and revert the configuration change.
Pros:
- Console access on a physical router does not depend on any network
- Definitive—once at the console, engineers can make changes
- No dependency on any software system
Cons:
- Time-consuming: engineers must physically travel to data center locations
- Physical security systems dependent on the network cause additional delays
- Hardware tamper protections slow manual intervention
- Requires personnel with physical proximity and appropriate access credentials
Why chosen: It was the only option that worked. Every remote path was severed. Physical presence was the only remaining out-of-band channel.
Recovery
Physical Access and Router Reconfiguration
Engineers were dispatched to data center locations to gain direct console access to backbone routers. The recovery process faced several friction points:
- Travel time: Engineers needed to physically reach data center locations, which are typically in remote areas for cost, power, and cooling reasons
- Security delays: Badge readers and automated door access systems depended on the corporate network. Manual security overrides and identity verification were required
- Hardware access: Data center hardware is designed with physical tamper resistance. Accessing router consoles required navigating security measures designed to prevent unauthorized changes
- Configuration complexity: The engineers needed to understand and revert the specific configuration change across backbone routers—under pressure, without access to internal documentation systems
Once engineers gained physical console access to backbone routers, they identified and reverted the configuration change that had severed backbone connectivity.
Staged Service Restoration
Restoring backbone connectivity did not immediately restore services. Facebook’s engineering team managed the restoration carefully to avoid a secondary failure from the traffic surge:
- Backbone connectivity restored: Router configurations reverted; inter-datacenter links came back online
- Edge health checks pass: Edge facilities detected restored backbone connectivity and began re-announcing BGP routes
- BGP propagation: Routes propagated through the global routing table. First routes re-appeared around 21:00 UTC; a brief instability at 21:11 UTC was resolved by 21:30 UTC
- DNS resolution restored: As BGP routes propagated, recursive resolvers could again reach Facebook’s authoritative DNS servers. Fresh DNS records replaced the expired caches
- Traffic surge management: With ~3.5 billion users’ devices retrying connections simultaneously, the returning traffic surge was massive. Meta has said data centers reported power-usage dips “in the range of tens of megawatts” during the outage and that “suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk”3. Engineers leaned on prior “storm” drills — exercises that simulate the loss of a service, data center, or whole region — to ramp services back carefully. Meta noted in the postmortem that they had never simulated a full backbone-loss storm before this incident; they have committed to running such drills going forward.
The staged approach was critical. A system that has been down for six hours accumulates state — queued messages, pending replication, deferred batch jobs — and turning everything on simultaneously risks a thundering herd that causes a secondary outage.
The Audit Tool Bug
The configuration audit system represents the most consequential failure in this incident. Facebook’s backbone configuration pipeline included an automated check designed to reject commands that would disconnect backbone links. This tool should have prevented the maintenance command from executing.
The tool had a bug: it did not correctly identify this specific command as dangerous. The command passed the audit check and executed. This is a single point of failure in the deployment pipeline—the one safeguard between a human error and a global outage failed silently.
Facebook stated post-incident that fixing the audit tool and strengthening configuration change safeguards was a top priority.
Cascading Failures
Third-Party Service Impact
The outage extended beyond Meta’s own properties:
- Facebook Login (OAuth): Millions of third-party websites and apps that use “Login with Facebook” for authentication were affected. Users could not sign in to services that depended on Facebook’s OAuth endpoints.
- Facebook SDK and Ads: Websites embedding Facebook Like buttons, Share buttons, comments widgets, and advertising tags experienced degraded performance. Catchpoint measured that the 95th-percentile document-complete time on the IR Top 100 sites “spikes and sustains at 20+ second higher” starting at 15:40 UTC8 — even for sites that were not Facebook properties — because embedded Facebook SDKs attempted to load resources from unreachable domains, blocking page rendering until the per-resource timeout fired.
- WhatsApp as communication infrastructure: In many countries, WhatsApp serves as primary communication infrastructure for businesses, healthcare, and government services. The outage disrupted these use cases globally.
Internal Communication Breakdown
Facebook’s internal tools all ran on Facebook’s infrastructure:
- Workplace (internal Facebook for companies): Down. Engineers could not post updates or coordinate through their primary communication platform.
- Internal wikis and documentation: Unreachable. Runbooks, incident response procedures, and contact information stored internally were inaccessible.
- Monitoring and alerting dashboards: Unable to function because they depended on the backbone.
Engineers coordinated through personal cell phones, non-Facebook messaging apps, and in some reported cases, showed up at offices physically to coordinate. This is a practical demonstration of why incident response communication tools must have zero dependency on the infrastructure they manage.
DNS Resolver Collateral Damage
The global DNS resolver infrastructure experienced significant collateral load:
- ~3.5 billion devices retried DNS queries for Meta domains every few seconds.
- Recursive resolvers (ISP resolvers and public resolvers like 1.1.1.1 and 8.8.8.8) saw massive query-volume spikes — Cloudflare measured ~30× normal traffic at 1.1.1.17.
- The retry storm consumed resolver capacity that would otherwise serve queries for unrelated domains.
- Smaller ISP resolvers experienced degraded performance for all domains — not just Meta’s — due to the query volume.
This is an inherent risk when an extremely high-traffic set of domains becomes simultaneously unresolvable: the retry behavior of billions of clients creates a distributed denial-of-service effect on the DNS resolver layer.
Lessons Learned
Technical Lessons
1. Out-of-Band Management Must Be Truly Out-of-Band
The insight: Facebook’s OOB management systems shared fate with the production backbone. When the backbone failed, the management plane failed with it. An OOB system that routes through any component of the system it manages is not actually out-of-band—it is an in-band system with a different name.
How it applies elsewhere:
- Any management network that shares physical infrastructure (same routers, same fiber, same switches) with the production network
- Cloud-based management consoles that depend on the same cloud region they manage
- VPN-based management access where the VPN terminates on infrastructure that can fail
- Monitoring systems that share network paths with the services they monitor
Warning signs to watch for:
- Your management access fails when you test it during a network outage drill
- Your OOB consoles route through the same L3 network as production traffic
- You have never tested management access with the primary network completely unavailable
What truly independent OOB looks like:
- Dedicated physical links (cellular, satellite, or dedicated ISP connection) that share zero infrastructure with the backbone
- Console servers with independent power and independent network connectivity
- Management credentials and procedures stored offline or on systems with independent hosting
2. Automated Safety Mechanisms Can Amplify Failures
The insight: The BGP route withdrawal was a safety mechanism working as designed—edge facilities should not advertise routes to unreachable destinations. But in this case, the safety mechanism transformed a backbone failure into a global DNS outage. The safety feature became the propagation path for the failure.
How it applies elsewhere:
- Health-check-based load balancer ejection that removes all backends during a transient failure, causing a complete outage instead of degraded service
- Circuit breakers that open simultaneously across all instances, causing total service unavailability instead of partial degradation
- Automated rollback systems that roll back a deployment across all regions simultaneously when one region has issues
Warning signs to watch for:
- Safety mechanisms that operate independently per-node but can trigger simultaneously across all nodes
- No distinction between “one node is unhealthy” and “all nodes are unhealthy” in your health check logic
- Automated responses that have no rate limiting or quorum requirements before taking global action
3. DNS Is a Hidden Single Point of Failure
The insight: Facebook ran its own authoritative DNS servers in its own infrastructure. This meant DNS availability was coupled to infrastructure availability. When the infrastructure failed, DNS failed, and no fallback existed—users could not even be redirected to a “we’re down” page because the domain names could not be resolved.
How it applies elsewhere:
- Self-hosted authoritative DNS with all nameservers in a single infrastructure
- DNS zones where all nameservers share a common failure domain (same network, same cloud provider, same physical location)
- Internal service discovery systems that depend on a single DNS infrastructure
Warning signs to watch for:
- All your authoritative nameservers resolve to IP addresses in the same /16 or /24 network block
- Your nameservers are hosted in the same data centers as the services they resolve
- You have not tested what happens to DNS when your primary network is unreachable
4. Configuration Change Audit Systems Are Critical Infrastructure
The insight: The audit tool that should have blocked the dangerous backbone command had a bug. A single bug in a safety system removed the last safeguard between human error and a global outage. Audit systems must be tested as rigorously as the systems they protect.
How it applies elsewhere:
- Pre-deployment validation checks that are trusted but not tested against the specific failure modes they claim to prevent
- Approval workflows where the automated check is treated as the definitive safety gate
- Configuration management systems where the validation logic is complex and its coverage is not verified
Warning signs to watch for:
- Your configuration audit system has never blocked a deployment in production
- The audit system is tested with synthetic inputs but not with actual dangerous commands
- There is no secondary validation or human review for changes that could affect global infrastructure
Process Lessons
1. Incident Communication Must Be Infrastructure-Independent
What happened: Engineers could not use Workplace, internal wikis, or monitoring tools because they all depended on the backbone. Coordination fell back to personal phones and non-Facebook messaging.
What this means: Every organization must maintain incident communication channels that have zero dependency on the organization’s own infrastructure. This includes:
- Contact lists stored outside the corporate network (printed, personal devices)
- Communication channels on third-party infrastructure (not self-hosted Slack, not corporate email if it is self-hosted)
- Incident response runbooks accessible offline
2. Physical Access Procedures Must Be Tested Under Network Failure
What happened: Physical security at data centers depended on the network. Engineers faced delays getting into the buildings they needed to access to fix the problem.
What this means: Physical security override procedures must be tested regularly with the assumption that the network is completely unavailable. Badge readers, door locks, and security cameras that fail-closed on network outage create a catch-22 where you cannot physically access the systems you need to restore the network.
Organizational Lessons
1. Scale Creates Unique Failure Modes
At Facebook’s scale, with 3.5 billion users across multiple products all sharing common infrastructure, the blast radius of an infrastructure failure is inherently global. The DNS retry storm from billions of devices is an emergent behavior that only occurs at this scale. The recovery traffic surge from 6 hours of accumulated state is similarly unique to services of this size.
Organizations operating at this scale must plan for failure modes that do not exist at smaller scales—including the second-order effects of their failures on shared internet infrastructure like DNS resolvers.
Applying This to Your System
When This Pattern Applies
You might face similar challenges if:
- Your management and monitoring infrastructure shares fate with your production infrastructure
- Your DNS authoritative servers are hosted in your own infrastructure without geographic or network diversity
- Your configuration change pipeline has automated validation that has never been tested with actually dangerous inputs
- Your data center physical access depends on network-connected security systems
- Your incident response communication tools run on the same infrastructure you operate
Checklist for Evaluation
- Trace the network path of your OOB management access—does it share any infrastructure component with production?
- Test your management access with the primary network completely down (not just degraded)
- Verify your DNS authoritative nameservers are hosted across genuinely independent network paths
- Confirm your incident communication tools work when your primary infrastructure is fully unavailable
- Test your configuration audit system with actually dangerous inputs (in a safe environment)
- Verify data center physical access procedures under complete network failure
- Identify all internal tools that depend on your production infrastructure and establish offline alternatives
- Conduct a “total backbone failure” tabletop exercise with your operations team
Starting Points
If you want to evaluate your risk:
- Map your management plane dependencies: Draw the network path from your laptop to your critical network equipment’s management console. Every hop that shares infrastructure with production is a shared-fate risk.
- Test DNS independence: Use
dig +traceto follow the resolution path for your domains. If all authoritative nameservers resolve to addresses in your network, you have a single point of failure. - Audit your audit tools: Review your configuration change validation. What specific failure modes does it check? Has it been tested with the actual commands it is meant to block? When was the last time it rejected a change?
- Run a communication drill: Attempt to coordinate an incident response using only channels that do not depend on your infrastructure. If your team cannot do this smoothly, invest in establishing and practicing alternative communication paths.
Conclusion
The Facebook October 2021 outage is the canonical example of shared-fate failure in internet infrastructure. A configuration error severed the backbone. The backbone failure triggered a safety mechanism (BGP withdrawal of the DNS-hosting prefixes) that was working as designed. The BGP withdrawal made DNS servers unreachable. DNS failure made all Meta properties unreachable for ~3.5 billion users. And every path to recovery — remote management, OOB consoles, internal documentation, physical building access — depended on the same backbone that had failed.
The root cause was not exotic: a maintenance command that should have been caught by an audit tool, which had a bug. What made the incident extraordinary was the depth of dependency coupling. The principle of defense in depth assumes that your recovery mechanisms are independent of your failure domains. This incident demonstrated what happens when that assumption is violated at every layer simultaneously.
The transferable lesson is not about BGP or DNS specifically. It is about dependency inversion in recovery paths: if the system you need to fix the problem depends on the problem being fixed, you have a recovery deadlock. Every organization should trace their recovery paths—management access, communication tools, physical access, documentation—and verify that none of them share fate with the infrastructure they are meant to recover.
Appendix
Prerequisites
- Understanding of BGP (Border Gateway Protocol) and how internet routing works (route announcements, AS paths, prefix advertisement and withdrawal)
- Familiarity with DNS resolution (authoritative vs. recursive resolvers, TTL, SERVFAIL behavior)
- Knowledge of network architecture concepts (backbone networks, edge/PoP facilities, peering)
- Understanding of out-of-band management concepts and why management plane separation matters
Terminology
| Term | Definition |
|---|---|
| AS (Autonomous System) | A collection of IP networks under a single administrative domain with a unified routing policy. Facebook operates AS32934. |
| BGP (Border Gateway Protocol) | The routing protocol used between autonomous systems to exchange reachability information. Defined in RFC 4271. |
| BGP route withdrawal | A BGP UPDATE message that removes a previously announced route, telling peers that a prefix is no longer reachable through this path. |
| Authoritative DNS server | The DNS server that holds the definitive records for a domain. Facebook’s authoritative servers (a–d.ns.facebook.com) lived in their edge facilities. |
| Recursive DNS resolver | A DNS server that resolves queries on behalf of clients by following the delegation chain from root servers to authoritative servers. Examples: 1.1.1.1 (Cloudflare), 8.8.8.8 (Google). |
| TTL (Time to Live) | A value in DNS records specifying how long a resolver may cache the record before it must re-query the authoritative server. |
| SERVFAIL | A DNS response code indicating the resolver could not complete the query—typically because the authoritative servers are unreachable or returned an error. |
| OOB (Out-of-Band) management | A management network that is physically or logically separate from the production network, intended to provide access when the production network fails. |
| Backbone network | A private WAN connecting an organization’s data centers, distinct from the public internet. Facebook’s backbone spans tens of thousands of miles of fiber. |
| Edge facility / PoP | A Point of Presence at the network edge, typically co-located at internet exchange points, serving as the interface between the backbone and the public internet. |
| Black hole (routing) | A condition where packets are forwarded to a destination that cannot deliver them, causing silent packet loss. BGP withdrawal prevents this. |
Summary
- A maintenance command intended to assess Facebook’s backbone capacity contained an error that severed all inter-datacenter backbone connections. An audit tool designed to block such commands had a bug that let it through.
- Edge facilities detected loss of backbone connectivity and withdrew the BGP advertisements for the prefixes whose servers they could no longer reach — a safety mechanism that prevents black-holing traffic. Only ~22 of AS32934’s ~349 prefixes were withdrawn, but those prefixes hosted the authoritative DNS that everything else depended on.
- With those DNS prefixes gone from the global routing table,
a.ns.facebook.com–d.ns.facebook.combecame unreachable. DNS TTLs expired within ~5 minutes, and recursive resolvers worldwide returned SERVFAIL for all Meta domains, affecting ~3.5 billion users and driving 1.1.1.1 to ~30× normal query volume. - Every recovery path shared fate with the failure: remote management, OOB consoles, Workplace, wikis, dashboards, and even physical data center badge readers all depended on the backbone. Engineers had to physically travel to data centers and reconfigure backbone routers from the console, then bring services back gradually using “storm” drill playbooks to avoid a thundering-herd second outage.
- The transferable lesson is dependency inversion in recovery paths: out-of-band management must be truly independent of production infrastructure, automated safety mechanisms can amplify failures when they trigger globally, and configuration audit tools are critical infrastructure that must be tested as rigorously as the systems they protect.
References
- More Details About the October 4 Outage — Santosh Janardhan, VP of Infrastructure, Meta Engineering. Primary source for root cause, backbone architecture, audit tool bug, and recovery details.
- Update About the October 4th Outage — Meta Newsroom. Official company statement on the outage.
- Understanding How Facebook Disappeared from the Internet — Cloudflare Blog. Analysis of BGP withdrawal patterns and DNS resolver impact observed from Cloudflare’s 1.1.1.1 resolver infrastructure.
- Facebook’s Historic Outage, Explained — Doug Madory, Kentik. Detailed BGP analysis including AS32934 prefix counts (300+ prefixes), specific withdrawn prefixes (129.134.30.0/23, 129.134.30.0/24), and traffic pattern analysis.
- Facebook Down and Out in BGPlay — RIPE Labs. Real-time BGP visualization showing route withdrawal from 15:42 UTC to complete withdrawal at 15:53:47 UTC, and recovery beginning at ~21:00 UTC.
- Incident Review for the Facebook Outage — Catchpoint. HTTP-level analysis including 503 errors,
proxy-status: no_server_availableheaders, DNS SERVFAIL patterns, and third-party website impact (20+ second load time increase on IR Top 100 sites). - RFC 4271 — A Border Gateway Protocol 4 (BGP-4) — IETF. The specification governing BGP behavior, including UPDATE messages for route withdrawal.
- RFC 1034 — Domain Names: Concepts and Facilities — IETF. DNS architecture specification relevant to authoritative server delegation and resolver behavior.
- RFC 1035 — Domain Names: Implementation and Specification — IETF. DNS protocol specification including TTL behavior and SERVFAIL response semantics.
Footnotes
-
“Facebook’s historic outage, explained” — Doug Madory, Kentik, 5 Oct 2021 cites the 3.5B combined-user figure and computes the corresponding “1.2 tera-lapse” of unavailability. ↩
-
“Incident Review for the Facebook Outage” — Catchpoint, 4 Oct 2021 — RIS rrc10 (Milan IXP) showed AS32934 originating 133 IPv4 + 216 IPv6 networks before the event, and 8 IPv4 + 14 IPv6 of those withdrawn at ~15:40 UTC. ↩ ↩2 ↩3
-
“More details about the October 4 outage” — Santosh Janardhan, Meta Engineering, 5 Oct 2021. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
Press estimates derived from Meta’s 2021 revenue of ~
13.5 M/h. Meta has not published an official outage-cost figure. ↩ -
“Facebook Down and Out in BGPlay” — RIPE Labs, 5 Oct 2021. ↩ ↩2 ↩3 ↩4 ↩5
-
Catchpoint observed the DNS records expiring in cache “five minutes later” — see “Incident Review for the Facebook Outage” — Catchpoint, 4 Oct 2021. ↩
-
“Understanding how Facebook disappeared from the Internet” — Celso Martinho & Tom Strickx, Cloudflare, 4 Oct 2021. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Catchpoint reproduced the
HTTP/2 503response withproxy-status: no_server_available(e.g.date: Mon, 04 Oct 2021 16:48:36 GMT) and reports the Meta properties were “down until 21:05 UTC, when things began to gradually return to normality” — “Incident Review” (op. cit.). ↩ ↩2 ↩3 ↩4 -
Kentik traced
a.ns.facebook.com → 129.134.30.12 → 129.134.30.0/24 + 129.134.30.0/23in “Facebook’s historic outage, explained”. ↩