22 min read Last updated on Feb 8, 2026

Dropbox Magic Pocket: Building Exabyte-Scale Blob Storage

How Dropbox migrated 500+ petabytes off AWS S3 onto custom infrastructure in under two years, saving $74.6 million net while achieving higher durability than the service it replaced. Magic Pocket is a content-addressable, immutable block store built by a team of fewer than six engineers, now serving 700+ million users across 600,000+ storage drives.

Mermaid diagram
Magic Pocket's core architecture: clients interact with Edgestore for metadata, which routes to Magic Pocket frontends. Frontends consult the Block Index for placement and store blocks across three geographic zones with cross-zone replication.

Magic Pocket is a content-addressable, immutable block store where the key is a block’s SHA-256 hash and the value is a compressed, encrypted blob up to 4 MB. The mental model has three layers:

  • Metadata layer (Edgestore + Block Index): Maps file paths to block hashes, and block hashes to physical locations. Sharded MySQL under the hood.
  • Placement and routing layer (Frontends + Master): Determines which cell and volume stores each block. Handles deduplication at write time via hash lookup. The Master orchestrates repair and garbage collection but holds no authoritative state---the Replication Table does.
  • Physical storage layer (OSD nodes + Diskotech servers): 100 Shingled Magnetic Recording (SMR) drives per chassis, ~2 PB per machine, written append-only with no filesystem. Blocks start as 4x replicas (hot), then get erasure-coded to 1.5x overhead as they cool.

The design bets: immutable blocks eliminate consistency complexity, content-addressing enables deduplication, and SMR’s sequential-write constraint aligns with append-only semantics. The verification system (Pocket Watch) spends over 50% of disk I/O on continuous integrity checking---an unusual choice that reflects Dropbox’s stance that detecting corruption matters more than optimizing throughput.

Dropbox’s core product is file synchronization and storage. By 2015, the platform served 500+ million users, making it one of the largest consumers of AWS S3 globally.

MetricValue (2015)
Total data stored500+ PB
Users500+ million
Primary storage backendAWS S3
Data growth rate~40 PB/year (estimated from 2012’s 40 PB baseline)
AWS relationshipOne of S3’s largest customers

Storage is Dropbox’s core product---not a supporting service. When your entire business is storing files, the economics of renting per-GB storage from a cloud provider diverge sharply from owning infrastructure at scale.

Three factors drove the decision:

  1. Cost: Dropbox’s S-1 filing revealed they reduced infrastructure costs by **74.6millionnetover20162017aftermigrating.ThegrossAWSbillreductionin2016alonewas74.6 million net over 2016-2017** after migrating. The gross AWS bill reduction in 2016 alone was 92.5 million, offset by $53 million in new depreciation and facility costs.
  2. Control: A custom stack allowed co-designing hardware and software for Dropbox’s specific workload---immutable 4 MB blobs, write-once-read-many access patterns, extreme deduplication potential.
  3. Performance: End-to-end optimization of network, disk I/O, and encoding strategies that S3’s general-purpose design could not provide.
  • Team size: Fewer than 6 engineers initially. The VP of Engineering noted they could not hire 200 engineers, so they “got the best people and created a small team.”
  • Zero downtime: Migration had to be invisible to users. No maintenance windows, no degraded service.
  • Durability guarantee: Must match or exceed S3’s 11 nines of durability.
  • Geographic distribution: US-only data centers (colocation in West, Central, and East regions). European data remained on S3.
  • Timeline: Prototype started summer 2013; production serving by February 2015.

At Dropbox’s scale, the unit economics flip. For most companies, S3’s operational burden elimination justifies its per-GB premium. For Dropbox, storage is the product, not supporting infrastructure. Three properties of their workload made custom infrastructure particularly attractive:

Immutable blocks with content-addressing. Dropbox splits files into blocks (up to 4 MB), compresses them, and identifies each by its SHA-256 hash. A block never changes after creation. This eliminates the consistency complexity that makes general-purpose storage hard---no in-place updates, no concurrent modification conflicts, no cache invalidation.

Write-once-read-rarely access patterns. Over 40% of all file retrievals are for data uploaded in the last 24 hours. Over 90% of retrievals are for data less than a year old. This extreme skew means most stored data is cold, and cold storage can be optimized aggressively for density over latency.

Massive deduplication potential. Content-addressing means identical blocks (common across users who share files, or across versions of the same file) are stored once. At 500+ PB, even small deduplication ratios yield enormous savings.

S3 is a general-purpose object store. It cannot:

  • Allow co-design of disk firmware, server chassis, and storage software as a unified system
  • Use Shingled Magnetic Recording (SMR) drives that require application-level sequential write management
  • Provide erasure coding tuned to Dropbox’s specific durability and read-latency trade-offs
  • Enable the 50%+ I/O budget dedicated to integrity verification that Dropbox’s Pocket Watch system requires

Approach: Negotiate better rates, use S3 storage tiers (Glacier for cold data), optimize deduplication before upload.

Pros:

  • Zero engineering investment
  • Operational burden stays with AWS

Cons:

  • Per-GB economics still unfavorable at 500+ PB scale
  • No control over hardware or storage engine optimization
  • Cannot co-design for immutable-block workload

Why not chosen: Even with aggressive optimization, the cost differential at exabyte scale was too large. Storage is Dropbox’s core product, not ancillary infrastructure.

Approach: Migrate to a cheaper cloud provider (Google Cloud, Azure) or deploy an open-source distributed storage system (Ceph, HDFS) on leased hardware.

Pros:

  • Faster time to market than custom-built
  • Existing community and tooling

Cons:

  • General-purpose systems carry overhead for features Dropbox does not need (mutable objects, POSIX semantics, directory hierarchies)
  • Cannot optimize for immutable, content-addressed blocks
  • Limited hardware co-design possibilities

Why not chosen: The workload is narrow enough (immutable blobs, hash-keyed, append-only) that a purpose-built system could be dramatically simpler and more efficient than any general-purpose alternative.

Approach: Build a purpose-built, content-addressable block store optimized for immutable 4 MB blobs.

Pros:

  • Hardware-software co-design (custom server chassis, SMR drives, no filesystem)
  • Erasure coding and replication tuned to exact durability requirements
  • 50%+ I/O budget for verification without impacting cost efficiency
  • Long-term cost advantage compounds as data grows

Cons:

  • Multi-year engineering investment
  • Operational expertise must be built in-house
  • Single point of organizational risk (small team)

Why chosen: The narrow workload (immutable blobs) meant the system could be architecturally simple despite the scale. A team of fewer than 6 engineers could build it because the design eliminated most distributed systems complexity.

FactorOptimize S3Alternative StackCustom Build
Cost at 500+ PBHighMediumLow (after amortization)
Hardware co-designNoLimitedFull
Time to productionImmediate6-12 months~18 months
Workload optimizationNonePartialComplete
Operational complexityLowMediumHigh (initially)
Long-term compounding savingsNoPartialYes

Magic Pocket’s architecture reflects a single design principle: immutable, content-addressed blocks eliminate most distributed systems complexity. There are no distributed transactions, no consensus protocols, no cache invalidation for data. The system is a distributed key-value store where keys are SHA-256 hashes and values are compressed, encrypted blobs.

Mermaid diagram
Magic Pocket's data hierarchy: blocks aggregate into buckets, buckets into volumes replicated across OSDs, stored as extents within cells, organized by geographic zone.
LevelSizeDescription
BlockUp to 4 MBCompressed, AES-256 encrypted chunk; identified by SHA-256 hash
Bucket~1 GBLogical container aggregating blocks uploaded around the same time
VolumeOne or more bucketsReplicated or erasure-coded across multiple OSD nodes
Extent1-2 GBPhysical on-disk unit; written append-only, immutable once sealed
Cell~50 PB rawSelf-contained storage cluster with its own frontends, coordinators, and storage nodes
ZoneMultiple cellsGeographic region (US West, Central, East)

The Block Index is a sharded MySQL (InnoDB) cluster fronted by an RPC (Remote Procedure Call) service layer. It maps block_hash -> (cell, bucket_id, checksum). The primary key is the SHA-256 hash of each block.

Why MySQL? At write time, the only coordination required is checking whether a block hash already exists (deduplication) and recording where it was placed. This is a simple key-value lookup and insert---exactly what sharded MySQL excels at. The immutability of blocks means no updates, no conflicting writes, and no consistency concerns beyond initial placement.

A smaller MySQL database within each cell that maps bucket -> volume and volume -> (OSDs, open/closed, type, generation). The entire working set fits in memory for fast lookups. This is the authoritative record of where data physically lives within a cell.

Design choice: The Replication Table is the source of truth, not the Master coordinator. The Master is entirely soft-state---it can be restarted without losing any placement information. This avoids the complexity of distributed consensus for coordination.

Custom storage machines running the Diskotech server chassis:

SpecificationValue
Form factor4U chassis
Drives per chassis~100 Large Form Factor (LFF)
Capacity per chassis~2 PB (at 20 TB drives)
Capacity per rack~8 PB
Memory96 GB per host
CPU20 cores, 40 threads
Network50 Gbps NIC
ControllerHost Bus Adapter (HBA), no RAID controller
Disk accessDirect via libzbc/libzbd, no filesystem

The OSD storage engine was rewritten from Go to Rust by Jamie Turner’s team. Go’s garbage collector caused unpredictable latency spikes, and the constant threat of OOM (Out of Memory) conditions required C++-level control. The Rust rewrite achieved 3-5x improvement in tail latencies and enabled handling more disks per machine without increased CPU overhead.

Organizational lesson: James Cowling later noted that the Rust components worked well and required little maintenance, but when the original authors departed, insufficient Rust expertise remained. The components continued running reliably but innovation on them slowed.

The cell-level coordinator handles OSD health monitoring, triggering repair operations, creating storage buckets, and orchestrating garbage collection. Critically, the Master holds no authoritative state---the Replication Table is the source of truth. The Master can crash and restart without data loss or placement confusion.

This design avoids Paxos, Raft, or any quorum-based consensus for the coordination layer. The trade-off: recovery after Master failure requires rebuilding soft state from the Replication Table, but this is fast since the working set fits in memory.

Gateway nodes that accept storage requests, determine block placement by balancing cell load and network traffic, and route read/write commands to appropriate OSD nodes.

Asynchronously replicates all PUTs to at least one other geographic zone within approximately one second. Each block exists in at least two zones, ensuring zone-level fault tolerance.

  1. Frontend checks the Block Index for existing block (deduplication via SHA-256 hash).
  2. If the block does not exist, the frontend selects a target volume balancing cell load and network traffic.
  3. Issues store commands to all OSDs in the volume. All must fsync before responding---no acknowledgment until data is durable on disk.
  4. Adds entry to Block Index upon success.
  5. Retries with a different volume on failure.
  6. Cross-zone replication daemon asynchronously replicates within ~1 second.

Why synchronous replication within a volume? Data loss from a partial write (acknowledged to the client but not durable on all replicas) would violate the immutability contract. By requiring all replicas to fsync before acknowledging, the system guarantees that a successful write is immediately durable at the volume’s replication factor.

  1. Frontend queries Block Index for (cell, bucket_id).
  2. Consults Replication Table for volume and OSD locations.
  3. For replicated volumes: fetches from a single OSD.
  4. For erasure-coded volumes: reconstructs from multiple OSD nodes with Volume Manager assistance.

Magic Pocket uses a variant of Reed-Solomon erasure coding similar to Local Reconstruction Codes (LRC):

SchemeConfigurationStorage OverheadUse Case
Replicated4x within zone, 2x cross-zoneUp to 8x totalRecently uploaded (hot) data
Reed-Solomon6+3 (6 data, 3 parity)1.5xStandard encoding for warm data
LRC~12+2+2~1.33xOptimized reads, tolerates 3 failures

Why start with replication? Over 40% of file retrievals target data uploaded in the last 24 hours. High-redundancy replication ensures fast single-OSD reads for the hottest data. As blocks cool (access frequency drops), they are aggregated and erasure-coded---trading read simplicity for storage efficiency. This tiered approach means the system never pays the full 8x replication cost for the 90%+ of data that is rarely accessed.

Introduced around 2019, the cold tier uses fragment-based striping across geographic regions:

  • 2+1 scheme: A block is split into 2 data fragments plus 1 XOR parity fragment, each placed in a different geographic zone.
  • Storage overhead: 1.5x (compared to 2x+ for warm tier cross-zone replication).
  • 25% disk reduction with 3 zones; 33% savings with a 4-region 3+1 model.
  • Reads issue requests to all regions in parallel, use the fastest two responses, and cancel the rest.

Counter-intuitively, the 99th percentile read latency for cold storage was lower than warm storage. The parallel cross-region reads with hedged cancellation effectively reduced tail latency, even though individual region latency was higher.

Dropbox was the first major technology company to deploy Shingled Magnetic Recording (SMR) drives at petabyte scale. SMR drives overlap write tracks like roof shingles, increasing areal density but requiring sequential writes aligned to zone boundaries.

Why this is a perfect fit: Magic Pocket’s append-only, immutable write pattern naturally produces sequential writes. The OSD daemon manages sequential zones directly via libzbc/libzbd, bypassing the filesystem entirely.

MilestoneDate
First petabyte-scale SMR deploymentJune 2018
All new drives deployed as SMRSeptember 2018
~40% of total fleet on SMR2019
90% of HDD fleet on SMR2022

Drive capacity progression: 4 TB -> 8 TB -> 14 TB -> 18 TB -> 20 TB -> 26 TB (as of 2023).

Power efficiency gains: 5-6x reduction in power per terabyte since initial 4 TB deployment. Latest 18-20 TB drives operate at ~0.30 watts/TB idle and ~0.50 watts/TB under random reads. The sixth-generation server requires 5 fewer megawatts serving one exabyte compared to the fourth generation.

  • Data at rest uses AES-256 encryption.
  • Each block has its own block encryption key (BEK).
  • BEKs are wrapped with versioned global key encryption keys (KEKs).
  • Key rotation happens at the metadata level without rewriting encrypted data.
  • Deletion uses crypto-shredding: deleting encryption metadata renders the encrypted block permanently inaccessible. Physical disk space is reclaimed asynchronously.

Dropbox’s data center network evolved to a quad-plane, 3-tier Clos fabric:

SpecificationValue
Racks per pod16
Pods per fabric16 (expandable to 31 for ~500 racks)
Link speed100G everywhere
Oversubscription1:1 (non-blocking)
RoutingeBGP with ECMP
Uplink per rack4x100G

The non-blocking fabric is essential for Magic Pocket’s repair operations, which must move terabytes of data across hundreds of OSD nodes simultaneously when a drive fails.

Originally, Magic Pocket used an NVMe SSD caching layer for block metadata on each host. As drive density increased (14-20 TB per disk, 100 drives per host), the single NVMe SSD became a bottleneck at 15-20 Gbps write throughput.

Solution: Store both metadata and raw data inline on SMR disks using a new extent format that serializes blocks with their metadata directly to disk.

MetricBeforeAfter
Peak write throughputBaseline2-2.5x improvement
Production write throughputBaseline15-20% improvement
P95 write latency under loadBaseline15-20% improvement
Monthly SSD failure repairs8-100

Completed fleet-wide by Q1 2022.

DateMilestone
Summer 2013Small team (fewer than 6 engineers) begins coding prototype
August 2014Dark launch: Mirroring data from 2 regional locations
February 27, 2015Launch: Production serving exclusively from own infrastructure
April 30, 2015BASE Jump: Scaling phase begins
October 7, 201590% milestone: 90% of all user data on custom infrastructure
March 14, 2016Public announcement via engineering blog

Dual-write approach: Every uploaded file was simultaneously saved to both AWS S3 and Magic Pocket. This allowed instant rollback at the file level if Magic Pocket returned incorrect data.

Peak transfer rate: Over half a terabit per second on the internal network during bulk migration.

Gradual cutover: The team scaled from double-digit-petabyte prototypes to a multi-exabyte system within approximately 6 months.

Post-Migration

Approximately 10% of data remained on AWS S3---primarily for European users and regions where Dropbox lacked physical data center presence. In 2020, Dropbox built the Object Store abstraction layer, providing a unified API (PUT, GET, DELETE, LIST) that routes requests transparently between S3 and Magic Pocket. Object Store added write batching, range-based reads, and encryption, while also enabling deprecation of most HDFS (Hadoop Distributed File System) usage by late 2021.

More than 50% of the workload on disks and databases is internal verification traffic. This is an unusual design choice---most storage systems optimize for user-facing throughput. Dropbox chose to dedicate half their I/O capacity to catching corruption before it becomes data loss.

Mermaid diagram
Pocket Watch's five verification subsystems, each targeting a different failure mode.

Continuously reads back every bit on disk and validates against checksums. Full disk sweep every 1-2 weeks, with recently modified areas scanned more frequently. Automatically schedules re-replication and disk remediation upon detecting corruption.

Iterates through the Block Index at approximately 1 million blocks per second, confirming that blocks exist on their designated storage nodes. Full scans complete roughly weekly per zone.

Independent black-box verification that samples 1% of all written blocks. Retrieves each sampled block after intervals of 1 minute, 1 hour, 1 day, 1 week, and 1 month. This multi-interval approach detects end-to-end failures that manifest at different timescales---immediate write failures, short-term disk degradation, and long-term bit rot.

Verify that deleted extents were safely migrated before permanent removal. Maintain a temporary “trash” storage for an additional day after inspection, providing a last-chance recovery window.

Confirms that all user files exist in the appropriate storage zones by walking the File Journal system. Catches cross-zone replication gaps that the one-second async replication might miss during transient failures.

When an OSD goes offline for more than 15 minutes:

  1. Master closes all volumes on the failed OSD.
  2. Builds a reconstruction plan distributing load across hundreds of OSD nodes to avoid creating a new hotspot.
  3. Performs data transfer with erasure coding reconstruction where needed.
  4. Increments the generation number on new volumes (enabling stale-read detection if the old OSD comes back online with outdated data).
  5. Updates the Replication Table to commit changes atomically.

Repair rate: 4 extents per second (each 1-2 GB).

Repair SLA: Less than 48 hours for a full OSD reconstruction.

Auto-repair volume: Hundreds of hardware failures are repaired automatically per day across the fleet.

On November 18, 2021, Dropbox physically unplugged their San Jose (SJC) data center from the network for 30 minutes. Three on-site technicians physically disconnected optical fiber at three colocation facilities.

Result: Zero impact to global availability. The active-active design across three zones handled the loss transparently. Recovery Time Objective (RTO) improved by more than an order of magnitude compared to pre-2020 capabilities.

This was not a simulated failover---it was a real disconnection of production infrastructure serving user traffic.

YearGross AWS SavingsNew Infra Costs (depreciation, facilities)Net Savings
2016$92.5M$53.0M$39.5M
2017(continued)(continued)$35.1M
Total$74.6M

Source: Dropbox S-1 filing (2018).

MetricValue
Theoretical durability (Markov model)27 nines
Published durability guaranteeOver 12 nines
Availability99.99%
Cross-zone replication latency~1 second
Repair SLA (full OSD)Less than 48 hours
MetricValue
Storage drives deployed600,000+
SMR fleet percentage~90%
Users served700+ million
Request throughputTens of millions per second
Auto-repaired hardware failuresHundreds per day
Geographic zones3 US + AWS for Europe
Latest drive capacity26 TB
Hardware generation6th (7th in development)
Next-gen NIC requirement200G per host
  • SMR alignment: The append-only, immutable write pattern turned out to be ideal for SMR drives, enabling Dropbox to adopt higher-density drives years before competitors.
  • Cold storage latency improvement: Cross-region parallel reads with hedged cancellation produced lower P99 latency than warm storage single-region reads.
  • Crypto-shredding simplification: The per-block encryption key design meant data deletion required only metadata deletion, not physical overwrite---simplifying compliance with data deletion regulations.
  • US-only custom infrastructure: European data still relies on AWS S3 (~10% of total).
  • Rust expertise concentration: The OSD storage engine’s Rust codebase remained robust but innovation-constrained after original authors departed.
  • Single-vendor risk: The Diskotech server design creates dependency on specific drive vendors for SMR qualification and firmware collaboration.

The insight: Magic Pocket avoids distributed consensus, distributed transactions, and cache invalidation for data---the three hardest problems in distributed storage. It can do this because blocks are immutable and content-addressed. Fewer than 6 engineers built a system that replaced the world’s most sophisticated object store.

How it applies elsewhere:

  • Before building general-purpose infrastructure, ask whether your workload has properties that eliminate entire categories of complexity.
  • Append-only, immutable data models are dramatically simpler to distribute than mutable ones.

Warning signs to watch for:

  • Your storage system handles consistency edge cases that your actual data model does not require.
  • You are paying for POSIX semantics or mutable-object support that your application never uses.

The insight: By designing the Diskotech chassis, choosing SMR drives, writing a custom OSD engine without a filesystem, and building a non-blocking Clos network, Dropbox created an integrated stack where each layer amplifies the others. The power efficiency gains (5-6x per TB) and the SSD cache removal are examples of optimizations only possible with full-stack control.

How it applies elsewhere:

  • At sufficient scale, the cost of understanding hardware intimately pays for itself.
  • General-purpose layers (filesystems, RAID controllers, general-purpose SSDs) add overhead and limit optimization options.

The insight: Pocket Watch’s five verification subsystems consume over 50% of disk and database I/O. This seems wasteful until you consider that undetected corruption at exabyte scale would mean certain data loss. The cost of verification is the cost of maintaining durability guarantees.

How it applies elsewhere:

  • Durability claims without continuous verification are aspirational, not actual.
  • Multi-interval sampling (1 minute to 1 month) catches failure modes that single-point-in-time checks miss.

Warning signs to watch for:

  • Your durability SLA depends on assumptions about disk reliability that you are not actively validating.
  • You discover corruption through user reports rather than automated systems.

What they’d do differently: James Cowling later reflected that the Rust rewrite, while technically successful, created a skills gap when original authors departed. A small team building critical infrastructure must plan for knowledge transfer from day one.

When storage is your product (not a supporting service), the economics of building custom infrastructure become favorable at a lower scale threshold than most engineers assume. Dropbox’s $74.6M net savings in two years funded the ongoing engineering investment many times over.

Magic Pocket stores blocks, but mapping files to blocks requires a metadata system. Dropbox built Edgestore, a strongly consistent, read-optimized, horizontally scalable, geo-distributed metadata store.

  • Storage engine: MySQL (InnoDB)
  • Data model: Entities (objects with attributes) and Associations (relationships between entities)
  • Architecture: Language-specific SDKs (Go, Python) -> stateless Cores -> Caching Layer -> Engines -> MySQL
  • Consistency: Strong by default (cache invalidation on writes); eventual consistency optional
  • Scale: Thousands of machines, several trillion entries, millions of queries per second, 5 nines of availability

In later years, Dropbox built Panda, a petabyte-scale transactional key-value store sitting between Edgestore and sharded MySQL. Panda provides Atomicity, Consistency, Isolation, Durability (ACID) transactions with Multi-Version Concurrency Control (MVCC), eliminating the fleet-doubling expansion model of traditional MySQL sharding.

Dropbox is preparing for Heat-Assisted Magnetic Recording (HAMR) drives, which temporarily heat the recording media to 450 degrees Celsius during writes, enabling smaller thermally stable grains and projected capacities of 50+ TB per drive. The transition from SMR to HAMR represents the next major hardware-software co-design cycle for Magic Pocket.

Magic Pocket’s success rests on a single architectural insight: immutable, content-addressed blocks eliminate the hardest problems in distributed storage. This allowed a team of fewer than six engineers to build a system that replaced AWS S3, saved $74.6 million in two years, and now serves 700+ million users across 600,000+ drives.

The design decisions compound: immutability enables append-only writes, append-only writes enable SMR drives, SMR drives enable higher density, higher density drives justify custom chassis, custom chassis justify custom networks. Each choice opens the next optimization.

The most counterintuitive decision---spending 50%+ of I/O on verification rather than throughput---reflects a maturity about durability that distinguishes production storage systems from academic designs. At exabyte scale, undetected corruption is not a risk to be modeled; it is a certainty to be continuously hunted.

  • Distributed storage fundamentals (replication, erasure coding)
  • Content-addressable storage concepts
  • Basic understanding of disk technologies (HDD, SSD, SMR)
TermDefinition
SMRShingled Magnetic Recording---HDD technology that overlaps write tracks for higher density, requiring sequential writes
LRCLocal Reconstruction Codes---erasure coding variant enabling local repair without reading all shards
OSDObject Storage Device---a Magic Pocket storage node with ~100 drives
Extent1-2 GB append-only physical storage unit on disk
Cell~50 PB self-contained storage cluster within a zone
Crypto-shreddingData deletion by destroying encryption keys rather than overwriting ciphertext
HAMRHeat-Assisted Magnetic Recording---next-generation HDD technology using laser heating
Clos fabricNon-blocking network topology providing full bisection bandwidth
BEK/KEKBlock Encryption Key / Key Encryption Key---two-layer encryption hierarchy
  • Magic Pocket is a content-addressable, immutable block store using SHA-256 hashes as keys, built by fewer than 6 engineers.
  • The system migrated 500+ PB from AWS S3 using a dual-write strategy over ~18 months, saving $74.6M net in the first two years.
  • Data starts as 4x replicated (hot), then gets erasure-coded to 1.33-1.5x overhead as access frequency drops.
  • Pocket Watch verification consumes 50%+ of disk I/O across five subsystems, continuously validating durability at exabyte scale.
  • Custom Diskotech servers with 100 SMR drives per chassis (~2 PB per machine) enable 5-6x power efficiency gains per TB generation over generation.
  • The narrow workload (immutable blobs) eliminates distributed consensus, transactions, and cache invalidation, making extreme scale tractable for a small team.
Continue Reading
  • Previous

    LinkedIn and the Birth of Apache Kafka: Solving the O(N²) Data Integration Problem

    System Design / Real-World Case Studies 17 min read

    In 2010, LinkedIn faced a data infrastructure crisis: connecting 10 specialized data systems required 90 custom pipelines, each prone to failure. Oracle SQL*Loader jobs took 2-3 days with manual babysitting. Despite significant engineering effort, only 14% of their data reached Hadoop. Traditional messaging systems like ActiveMQ required a full TCP/IP roundtrip per message—unacceptable for billions of daily events. Jay Kreps, Neha Narkhede, and Jun Rao built Kafka to solve this: a distributed commit log that decoupled data producers from consumers, enabling any-to-any data flow through a single, scalable pipeline. This case study examines the architectural decisions that made Kafka the backbone of modern data infrastructure.

  • Next

    Instagram: From Redis to Cassandra and the Rocksandra Storage Engine

    System Design / Real-World Case Studies 21 min read

    How Instagram migrated critical workloads from Redis to Apache Cassandra, achieving 75% cost savings, then engineered a custom RocksDB-based storage engine to eliminate JVM garbage collection stalls and reduce P99 read latency by 10x. A seven-year evolution from 12 nodes to 1,000+ nodes serving billions of operations daily.