Uber Schemaless: Building a Scalable Datastore on MySQL

In early 2014, Uber faced an infrastructure crisis: trip data growth was consuming database capacity so rapidly that their systems would fail by year’s end without intervention. Rather than adopting an existing NoSQL solution, Uber built Schemaless—a thin, horizontally scalable layer on top of MySQL that prioritized operational simplicity over feature richness. This case study examines the architectural decisions that enabled Schemaless to scale from crisis point to billions of trips, the trade-offs inherent in its append-only cell model, and the design patterns that made MySQL work as a distributed datastore.

Mermaid diagram — Schemaless architecture: worker nodes route requests to sharded MySQL storage nodes. Each cell is immutable and versioned.

Abstract

The core insight: Instead of adopting Cassandra or building a custom distributed database, Uber layered a simple abstraction over MySQL. The key innovation was the cell model—an immutable, versioned data unit identified by (row_key, column_name, ref_key). Writes append new versions rather than updating in place, making every operation idempotent and enabling simple replication.

Why it worked:

Operational familiarity: Uber engineers knew MySQL. Cassandra and Riak offered linear scaling but lacked operational maturity at Uber’s scale and the consistency guarantees needed for trip data.
Fixed sharding: 4,096 shards mapped deterministically from row keys. No dynamic rebalancing complexity.
Buffered writes: If a master fails, writes buffer to another master (hinted handoff). Write availability prioritized over immediate read consistency.
Triggers: The append-only model created a natural change log, enabling event-driven architectures for billing, analytics, and cross-datacenter replication.

The trade-off: Eventual consistency, denormalized indexes, and application-level complexity in exchange for horizontal scalability and operational simplicity.

Context

The System

In early 2014, Uber’s trip data infrastructure consisted of a single PostgreSQL instance. This was the largest percentage of storage, grew fastest, and contributed the most IOPS (Input/Output Operations Per Second) of any data type.

Metric	Value (Early 2014)
Annual trips	140 million
Monthly growth rate	~20%
Database	Single PostgreSQL instance
Pain point	Vertical scaling limit

The Trigger

Problem statement: Growth would exhaust infrastructure by end of 2014 without intervention.

Why PostgreSQL couldn’t scale:

Write amplification: PostgreSQL’s MVCC (Multi-Version Concurrency Control) stores the ctid (tuple ID) in every index. When a row updates, all indexes add new entries even if only one column changed. This caused excessive disk writes and storage costs.
Replication overhead: WAL-based (Write-Ahead Log) replication logged all low-level changes, leading to high bandwidth usage for cross-datacenter replication.
Connection model: Process-per-connection was less efficient than MySQL’s thread-per-connection for high connection counts.

Alternatives Evaluated

Option	Pros	Cons	Outcome
Cassandra	Linear scaling, tunable consistency	Eventual consistency complicates index implementation; operational immaturity at Uber’s scale	Rejected
Riak	Masterless, conflict resolution	Similar consistency challenges; CRDT complexity	Rejected
MongoDB	Document model, familiar query language	Scaling concerns; consistency model issues for trip data	Rejected
Custom on MySQL	Operational familiarity, proven reliability	Build cost, no built-in distribution	Chosen

The deciding factors: Uber engineers had confidence in operating MySQL. NoSQL systems offered features but lacked the operational maturity and consistency guarantees needed for critical trip data. Building a thin layer on MySQL let them leverage existing expertise.

Design Inspirations

FriendFeed’s schemaless approach: Store JSON blobs with minimal schema constraints
Pinterest’s operational principles: Simple, sharded MySQL-based architecture
Google Bigtable: Sparse, three-dimensional persistent hash map (row, column, timestamp)

The Cell Model

Core Data Structure

The fundamental unit is an immutable cell:

1
Cell = (row_key, column_name, ref_key) → JSON body

Component	Type	Purpose
`row_key`	UUID	Primary identifier (e.g., trip_uuid)
`column_name`	String	Attribute category (e.g., “BASE”, “FARE”)
`ref_key`	Integer	Version number, monotonically increasing
`body`	JSON	Arbitrary JSON payload, no schema enforcement

Example: A trip’s base data:

1
{
2
  "row_key": "trip_uuid_abc123",
3
  "column_name": "BASE",
4
  "ref_key": 3,
5
  "body": {
6
    "driver_uuid": "driver_xyz",
7
    "rider_uuid": "rider_456",
8
    "status": "completed",
9
    "pickup_lat": 37.7749,
10
    "pickup_lng": -122.4194
11
  }
12
}

Immutability and Versioning

Key design decision: Cells are never updated in place. To modify data, write a new cell with a higher ref_key.

Why immutability:

Idempotent writes: Retrying a write is always safe. If the network fails mid-write, retry without side effects.
Built-in versioning: Every write creates a change log entry. Full history is preserved.
Simplified replication: No conflict resolution needed—highest ref_key wins.
Natural trigger support: Triggers can track all mutations by scanning new ref_keys.

Trade-off: Storage growth. Old versions accumulate. Compaction strategies needed for long-lived entities.

Reading and Writing Cells

Write path:

Client sends cell to worker node
Worker hashes row_key to determine shard (0-4095)
Worker routes to storage node master for that shard
Master writes to MySQL, replicates to minions asynchronously

Read path:

Client requests cell by row_key and column_name
Worker routes to storage node (default: master, configurable per request)
Storage node returns cell with highest ref_key

Sharding Architecture

Fixed Shard Count

Schemaless uses 4,096 fixed shards. A cell maps to a shard based on its row_key hash:

1
shard_id = hash(row_key) % 4096

Why fixed shards:

No rebalancing: Adding capacity means moving entire shards, not rehashing data
Deterministic routing: Any worker can compute the shard without coordination
Operational simplicity: Fixed mapping simplifies debugging and operations

Trade-off: 4,096 is a hard limit on parallelism. For Uber’s trip data, this was sufficient—each shard handles roughly 1/4096th of traffic.

Storage Node Topology

Each shard runs on a cluster with one master and two minions:

1
Shard N:
2
  └── Master (writes, reads by default)
3
      ├── Minion A (async replica, DC 1)
4
      └── Minion B (async replica, DC 2)

Master: Receives all writes. Handles reads when strong consistency is needed.

Minions: Async replicas in different datacenters. Handle read traffic to reduce master load. Eventually consistent—lag is typically sub-20ms.

MySQL as Storage

Each Schemaless shard is a separate MySQL database containing:

Table	Purpose
Entity table	Stores cells with compound index on (row_key, column_name, ref_key)
Index tables	One per secondary index, denormalized
Coordination tables	Trigger offsets, metadata

Why MySQL worked:

B-tree index on (row_key, column_name, ref_key) enables efficient cell lookups
InnoDB provides ACID within a shard
Mature replication, backup, and operational tooling

Later optimization: Uber migrated from InnoDB to MyRocks (RocksDB-based storage engine) for improved compression and write performance.

Worker and Storage Nodes

Worker Node Responsibilities

Worker nodes are stateless HTTP servers that:

Route requests: Hash row_key → shard → storage node
Aggregate results: For queries spanning multiple shards, fan out and merge
Implement circuit breakers: Detect storage node failures, failover to replicas
Buffer writes: If master is down, persist writes to another master (hinted handoff)

Scaling: Worker nodes scale horizontally. Add more for capacity. By mid-2016, Uber ran several thousand worker nodes.

Storage Node Responsibilities

Storage nodes are MySQL instances that:

Store cells: Primary data persistence
Execute queries: Cell lookups, index scans
Replicate: Master → minion async replication

Scaling: Add more shards (up to 4096) or increase hardware per shard. By mid-2016, Uber ran thousands of storage nodes across 40+ Schemaless instances.

Fault Tolerance

Worker node failure: Other workers continue serving. Client load balancer routes around failure.

Storage node master failure:

Worker detects via circuit breaker
Writes buffer to another master (hinted handoff)
Reads failover to minion (eventually consistent)
Minion can be promoted to master

Storage node minion failure: No impact on writes. Reads failover to other replicas or master.

Buffered writes (hinted handoff):

Trade-off: Buffered writes are not immediately readable. Client knows the master is down but cannot read the buffered data until recovery.

Secondary Indexes

Eventually Consistent Design

Secondary indexes are not updated in the same transaction as cell writes. This avoids two-phase commit overhead.

Write flow:

Cell written to entity table (ACK to client)
Trigger fires asynchronously
Index table updated with denormalized data

Consistency guarantee: Index lag is typically <20ms but not bounded. Applications must handle stale reads.

Denormalized Index Structure

Indexes store denormalized data directly, enabling single-shard queries:

1
// Index: trips by driver
2
row_key: driver_uuid
3
column_name: "TRIP_INDEX"
4
body: {
5
  trip_uuid: "abc123",
6
  status: "completed",
7
  timestamp: 1609459200,
8
  // Frequently-queried fields denormalized here
9
}

Query path:

Query index by driver_uuid → get trip_uuids and denormalized data
If denormalized data is sufficient, return immediately
If full trip needed, fetch from entity table

Why denormalization: Avoids cross-shard joins. All data for an index query lives in one shard (the shard of the index row_key).

Trade-off: Storage duplication. Index updates lag entity writes. Application complexity in deciding what to denormalize.

Trigger System

Event-Driven Architecture

The append-only cell model creates a natural change log. Triggers subscribe to cell changes and execute application logic asynchronously.

1
@trigger(column="BASE", table="trips")
2
def on_trip_update(cell):
3
    if cell.body.status == "completed":
4
        charge_rider(cell.row_key)
5
        notify_driver(cell.row_key)

Trigger Architecture

Implementation:

Each shard maintains an auto-incrementing added_id for cells
Trigger workers track their offset (last processed added_id) per shard
Workers poll for cells with added_id > offset
Process cell, advance offset

Delivery guarantee: At-least-once. Triggers must be idempotent.

Scaling: Up to 4,096 trigger workers (one per shard). Workers can be added/removed without stopping processing.

Use Cases

Use Case	Trigger Column	Action
Billing	`BASE` (status = completed)	Charge rider
Analytics	Any	Write to data warehouse
Cross-DC replication	Any	Replicate to remote datacenter
Notifications	`BASE`	Push to driver/rider

Fault Tolerance

Worker failure: Leader redistributes shard assignments to remaining workers.

Leader failure: New leader elected. Processing continues without redistribution.

Offset tracking: Offsets persisted to ZooKeeper or Schemaless itself. On restart, resume from last committed offset.

Migration: Project Mezzanine

Dual-Write Strategy

Migration from PostgreSQL to Schemaless used a dual-write pattern:

Phase 1: All writes go to PostgreSQL. Shadow writes to Schemaless.
Phase 2: Shadow reads from Schemaless, compare with PostgreSQL results.
Phase 3: Primary reads from Schemaless, PostgreSQL as fallback.
Phase 4: Schemaless primary for reads and writes. PostgreSQL deprecated.

Timeline

Date	Milestone
Early 2014	Crisis identified: growth would exhaust capacity by year end
~5 months	Schemaless design and implementation
6 weeks	Final intensive development sprint
Mid-October 2014	Go-live (“a month before Halloween”)
October 31, 2014	Halloween (high-traffic day) - no incidents

Why “Mezzanine”: The project codename referred to the trip datastore—Uber’s first production Schemaless instance.

Migration Safeguards

Abstraction layer: lib/tripstore with feature flags for routing
Shadow verification: All queries replayed to both systems, results compared
Gradual cutover: Traffic shifted incrementally with rollback capability
Zero-downtime requirement: No planned maintenance windows

Outcome

Scale Achieved

Metric	Value (2016)
Schemaless instances	40+
Storage nodes	Thousands
Worker nodes	Several thousand
Trips milestone	2 billion (October 2016)

Performance Improvements

After rewriting the sharding layer from Python to Go (“Project Frontless”, December 2016):

Metric	Improvement
Median latency	85% decrease
P99 latency	70% decrease
CPU utilization	85% decrease

Replication Performance

Multi-datacenter replication (Herb engine):

Metric	Value
Replication throughput	550-900 cells/second median
Replication latency	82-90 milliseconds

Operational Benefits

Horizontal scaling: Add shards/nodes without application changes
Operational familiarity: MySQL tooling, monitoring, backup procedures
Fault isolation: Shard failures don’t cascade
Development velocity: Teams work independently on different Schemaless instances

Evolution to Docstore

Why Schemaless Needed Replacement

By 2020, Schemaless limitations became friction:

Limitation	Impact
Immutability constraint	Developer friction for general-purpose use cases
Eventual consistency only	Complex application logic for consistency-sensitive workloads
No transactions	Multi-entity operations required application-level coordination
Limited query capability	Cell-level lookups only; no complex queries

Docstore Architecture

Published February 2021, Docstore provides:

Feature	Schemaless	Docstore
Data model	Immutable cells	Mutable documents
Consistency	Eventual	Strict serializability per partition
Transactions	None	Full transaction support
Schema	Schemaless	Schema with enforcement
Query capability	Cell lookups	Rich query engine

Key innovation: MySQL transactions replicated via Raft consensus. Users can imagine transactions execute sequentially within a partition.

Docstore design goal: “Provide the best of both worlds: schema flexibility of document models, and schema enforcement seen in traditional relational models.”

Lessons Learned

Technical Lessons

1. Operational Familiarity Trumps Features

The insight: Uber chose MySQL not because it was the best distributed database, but because engineers knew how to operate it. Cassandra’s features were less valuable than MySQL’s operational predictability.

When this applies:

Team has deep expertise in a technology
Production reliability matters more than feature richness
Debugging and recovery procedures are well-established

Warning signs you might need this:

Incidents caused by unfamiliar database behavior
Operations team can’t debug production issues
Recovery procedures are undocumented

2. Immutability Simplifies Distribution

The insight: By making cells immutable, Schemaless eliminated entire classes of problems: update conflicts, partial updates, consistency during replication. The trade-off—storage growth—was acceptable for trip data.

When immutability works:

Event-like data (transactions, trips, logs)
When full history has value
When storage is cheaper than coordination complexity

When it doesn’t:

Frequently-updated entities
Large objects with small updates
When storage costs dominate

3. Fixed Sharding Avoids Coordination

The insight: Dynamic resharding is complex. Fixed 4,096 shards meant no coordination for routing, no rebalancing during growth. The limit was acceptable for trip data volumes.

Trade-off: 4,096 is a hard scaling limit. Beyond this, each shard handles more load rather than spreading to more shards.

When fixed shards work:

Predictable upper bound on data/traffic
Operational simplicity matters
Resharding complexity isn’t worth the flexibility

4. Triggers Enable Event-Driven Architecture

The insight: The append-only model created a natural change log. Triggers turned the database into an event source, enabling billing, analytics, and replication without additional infrastructure.

Pattern: If your writes are append-only, you already have event sourcing. Expose it.

Process Lessons

1. Build on What You Know Under Pressure

The insight: With a year to prevent infrastructure collapse, Uber chose to build on MySQL rather than learn a new system. This was the right trade-off for their timeline and risk profile.

When to apply:

Tight timelines with high stakes
Team has deep expertise in an alternative
New technology would add operational risk

2. Abstraction Layers Enable Safe Migration

The insight: lib/tripstore abstracted the storage layer, enabling dual-write, shadow verification, and gradual cutover. Without this abstraction, migration would have been riskier.

Pattern: Before migrating storage, introduce an abstraction layer. This adds short-term complexity but de-risks the migration.

Applying This to Your System

When Schemaless-Style Architecture Applies

You might benefit from this pattern if:

Your data is event-like (trips, transactions, logs)
You need horizontal write scaling beyond a single database
Your team has MySQL/PostgreSQL expertise
You can tolerate eventual consistency for secondary indexes
Immutability makes sense for your data model

When It Doesn’t Apply

This pattern is wrong if:

You need strong consistency across entities
Your data requires frequent in-place updates
You need complex queries (joins, aggregations)
Storage costs are a primary concern
Your team prefers managed services over operational control

Starting Points

If exploring this architecture:

Model your data as cells: Can entities be decomposed into (key, column, version)?
Prototype on single MySQL: Implement cell model, indexes, triggers
Add sharding layer: Fixed hash-based routing to multiple databases
Implement buffered writes: Handle master failures gracefully
Build trigger infrastructure: Enable event-driven workflows

Conclusion

Schemaless succeeded by being deliberately limited. Instead of building a feature-complete distributed database, Uber built a thin layer that solved their specific problem: horizontal scaling for trip data using technology they knew how to operate. The cell model’s immutability traded storage efficiency for operational simplicity. Fixed sharding traded flexibility for predictability. Eventually consistent indexes traded correctness for scalability.

These trade-offs were right for trip data in 2014. They became limitations by 2020, leading to Docstore. The lesson isn’t that Schemaless was the “right” architecture—it’s that architecture choices are contextual. Uber’s constraints (team expertise, timeline, data characteristics) led to different choices than a greenfield system might make today.

Appendix

Prerequisites

Understanding of sharding and replication concepts
Familiarity with MySQL/PostgreSQL operations
Basic knowledge of eventual consistency trade-offs
Understanding of event-driven architecture patterns

Terminology

Cell: Immutable data unit identified by (row_key, column_name, ref_key) with JSON body
Shard: One of 4,096 fixed partitions; determined by hash(row_key)
Worker node: Stateless routing layer between clients and storage
Storage node: MySQL instance storing cells for a shard
Master/Minion: Primary write server and async replicas
Hinted handoff: Buffering writes to another master when target is unavailable
Trigger: Async function invoked on cell changes

Summary

Uber built Schemaless in 2014 to escape single-PostgreSQL scaling limits
Core innovation: immutable cells versioned by ref_key, stored in sharded MySQL
Fixed 4,096 shards with deterministic routing eliminated resharding complexity
Eventually consistent secondary indexes traded consistency for scalability
Trigger system enabled event-driven billing, analytics, and replication
Migration used dual-write with shadow verification for zero-downtime cutover
Scaled to 40+ instances, thousands of nodes, billions of trips by 2016
Evolved to Docstore by 2021 when strong consistency and transactions became priorities

References

Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL - Part 1: Core design and cell model
The Architecture of Schemaless, Uber Engineering’s Trip Datastore Using MySQL - Part 2: Worker/storage architecture
Using Triggers On Schemaless, Uber Engineering’s Datastore Using MySQL - Part 3: Trigger system
Project Mezzanine: The Great Migration - Migration strategy and execution
Code Migration in Production: Rewriting the Sharding Layer of Uber’s Schemaless Datastore - Python to Go rewrite (Project Frontless)
Evolving Schemaless into a Distributed SQL Database - Evolution to Docstore
Herb: Multi-DC Replication Engine for Uber’s Schemaless Datastore - Multi-datacenter replication
Why Uber Engineering Switched from Postgres to MySQL - PostgreSQL limitations analysis