Transactions and ACID Properties

Database transactions provide the foundation for reliable data operations: atomicity ensures all-or-nothing execution, consistency maintains invariants, isolation controls concurrent access, and durability guarantees persistence. This article walks through the implementation mechanisms (WAL, ARIES, MVCC, locking), the isolation-level semantics that the major engines actually deliver, the distributed protocols (2PC, 3PC, Spanner’s TrueTime, Calvin), and the practical alternatives (sagas, outbox, idempotency keys) that take over when synchronous global transactions become too expensive.

ACID and the three transaction-coordination regimes: single-node ACID, synchronous distributed transactions, and async alternatives — ACID is delivered by three different regimes — single-node ACID via WAL and MVCC, synchronous coordinators (2PC, Spanner) for global atomicity, and async alternatives (sagas, outbox) when coordinators are too expensive.

Abstract

Transactions wrap multiple operations into a logical unit with specific guarantees. The mental model:

ACID is not monolithic—each property addresses a distinct failure mode:

Property	Failure Addressed	Implementation Mechanism
Atomicity	Partial failures (crash mid-operation)	Write-Ahead Logging, shadow paging
Consistency	Invariant violations	Application logic + constraints
Isolation	Concurrent access anomalies	MVCC, locking, SSI
Durability	Data loss after commit	fsync, replication

Isolation is the complex one—it’s a spectrum, not binary:

Read Uncommitted → sees uncommitted changes (dirty reads)
Read Committed → sees only committed data (most databases’ default)
Repeatable Read → same query returns same results (but phantoms possible in SQL standard)
Serializable → equivalent to serial execution (no anomalies)

Distributed transactions are expensive—2PC blocks on failures, 3PC doesn’t solve network partitions, and both add latency. Modern systems often avoid them:

Google Spanner: Uses TrueTime to achieve external consistency without classic 2PC blocking
Microservices: Sagas with compensating transactions provide eventual atomicity
Event-driven systems: Outbox pattern guarantees local consistency + eventual delivery

Key insight: The right transaction strategy depends on your consistency requirements, latency budget, and failure tolerance—not all data needs the same guarantees.

ACID Properties Deep Dive

Atomicity: All or Nothing

Atomicity guarantees that a transaction either completes entirely or has no effect. If a system crashes mid-transaction, partial changes are rolled back.

Write-Ahead Logging (WAL)

WAL is the dominant atomicity mechanism in modern databases. The principle: log changes before applying them to data files.

Core rule: “Log first, apply later.” A write is durable when its log record reaches disk, not when the data page is updated.

Implementation:

1// Conceptual WAL implementation2interface LogRecord {3  lsn: number // Log Sequence Number4  transactionId: string5  operation: "INSERT" | "UPDATE" | "DELETE"6  table: string7  beforeImage: any // For undo (rollback)8  afterImage: any // For redo (recovery)9  prevLsn: number // Previous record in same transaction10}1112// WAL guarantees:13// 1. Log record written before data page modification14// 2. All log records for a transaction written before COMMIT15// 3. COMMIT record written before acknowledging client16// Recovery uses log to redo committed, undo uncommitted

PostgreSQL’s WAL (as of PostgreSQL 17):

WAL segments are 16 MB by default; the size is fixed at initdb time and can be changed with --wal-segsize since 11.
wal_level controls what is logged: minimal, replica, or logical.
fsync = on (default) is the global durability switch; turning it off invites unrecoverable corruption on crash.
synchronous_commit controls whether the COMMIT acknowledgment waits for WAL flush. With off, the worst-case data-loss window is bounded at three times wal_writer_delay (default 200 ms, so ~600 ms), and the database remains self-consistent.

Why WAL beats direct writes:

Sequential I/O: Log writes are sequential (fast); data page writes are random (slow)
Group commit: Multiple transactions’ commits batched into single fsync
Reduced writes: Data pages updated in memory, flushed lazily via checkpoints

ARIES Recovery Algorithm

ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), developed at IBM Research in 1992, remains the foundation of modern database recovery. It popularized the “repeating history” paradigm.

Three recovery phases:

Analysis: Scan log from last checkpoint, determine which transactions were active and which pages were dirty
Redo: Replay all logged operations (including uncommitted) to restore database to crash state
Undo: Roll back uncommitted transactions by applying undo records in reverse order

Key insight: ARIES redoes everything first, then undoes selectively. This “repeat history” approach simplifies recovery logic—the database reaches exactly the crash-time state before selective rollback.

Compensation Log Records (CLRs): During undo, ARIES logs the undo operations themselves. If crash occurs during recovery, these CLRs prevent re-undoing already-undone work.

Shadow Paging: The Alternative

Shadow paging maintains two copies of database pages: current (being modified) and shadow (stable reference).

Mechanism:

Transaction starts: shadow page table points to committed pages
Modifications create new page copies; current page table updated
Commit: atomically swap current page table to become the new shadow
Abort: discard current page table, shadow remains unchanged

Trade-offs vs WAL:

✅ Simpler recovery (just use shadow pages)
✅ No undo phase needed
❌ Data fragmentation (pages scattered on disk)
❌ Requires copying pages, not just logging changes
❌ Garbage collection overhead

Real-world usage: SQLite uses a variant (rollback journal), LMDB uses copy-on-write B-trees. Most production OLTP databases use WAL because of better write performance.

Consistency: Application Responsibility

Consistency in ACID means the database moves from one valid state to another. This is primarily the application’s responsibility, not the database’s.

What databases guarantee:

Constraint enforcement: Primary keys, foreign keys, unique constraints, CHECK constraints
Trigger execution: Automatic actions on data changes
Cascade operations: Referential integrity maintenance

What applications must ensure:

Business rules (account balance >= 0)
Cross-table invariants (order total = sum of line items)
Domain logic (valid state transitions)

The distinction matters: A database can enforce “balance is a number” but not “balance reflects all valid transactions.” The latter requires correct application logic.

Example inconsistency:

1-- Database constraints satisfied, business rule violated2-- Application bug: double-credited an account3UPDATE accounts SET balance = balance + 100 WHERE id = 'A';4UPDATE accounts SET balance = balance + 100 WHERE id = 'A';  -- Duplicate!5-- No constraint violation, but data is inconsistent with business intent

Isolation: The Complex Property

Isolation determines what concurrent transactions see of each other’s changes. It’s a spectrum of guarantees with different performance costs.

Isolation Anomalies

Understanding anomalies is key to choosing isolation levels:

Anomaly	Description	Example
Dirty Read	Reading uncommitted data	T1 writes, T2 reads, T1 aborts → T2 saw data that never existed
Non-Repeatable Read	Same query, different results	T1 reads row, T2 updates and commits, T1 re-reads → different value
Phantom Read	New rows appear in range query	T1 queries range, T2 inserts row in range, T1 re-queries → extra row
Lost Update	Concurrent updates overwrite each other	T1 and T2 read balance=100, both update to balance+10 → balance=110 (should be 120)
Write Skew	Constraint violated by concurrent non-overlapping writes	Two doctors both check “≥1 on call”, both remove themselves → zero on call

Write skew is particularly insidious because each transaction individually maintains the invariant, but together they violate it. Martin Kleppmann’s “Designing Data-Intensive Applications” popularized the doctors on-call example:

1-- T1: Doctor Alice checks if she can leave2SELECT COUNT(*) FROM on_call WHERE shift = 'tonight';  -- Returns 2 (Alice, Bob)3-- Alice decides it's safe to leave (at least 1 will remain)4DELETE FROM on_call WHERE doctor = 'Alice' AND shift = 'tonight';5COMMIT;67-- T2: Simultaneously, Doctor Bob runs the same logic8SELECT COUNT(*) FROM on_call WHERE shift = 'tonight';  -- Also returns 2 (snapshot)9DELETE FROM on_call WHERE doctor = 'Bob' AND shift = 'tonight';10COMMIT;1112-- Result: No doctor on call! Both saw 2, both removed themselves.

SQL Standard Isolation Levels

The SQL-92 standard defines four isolation levels by which anomalies they permit:

Level	Dirty Read	Non-Repeatable Read	Phantom Read
Read Uncommitted	Possible	Possible	Possible
Read Committed	Not possible	Possible	Possible
Repeatable Read	Not possible	Not possible	Possible
Serializable	Not possible	Not possible	Not possible

Critical caveat: The SQL standard defines these levels only by which anomalies are permitted, not by which must occur. Databases can (and do) provide stronger guarantees than required.

Durability: Surviving Crashes

Durability guarantees that committed transactions survive system failures. Implementation involves:

fsync and Write Barriers

The problem: Operating systems buffer writes. A “successful” write may sit in OS cache when power fails.

fsync() forces buffered data to persistent storage. But it’s expensive—SSDs can fsync at ~10K-100K ops/sec; HDDs at ~100-200 ops/sec.

PostgreSQL durability settings:

1-- Full durability (default)2SET fsync = on;3SET synchronous_commit = on;45-- Relaxed: don't wait for WAL to reach disk6SET synchronous_commit = off;7-- Risk: lose recent commits on crash (up to 3 x wal_writer_delay,8-- default 200 ms => ~600 ms window).9-- No corruption risk — just uncommitted data loss.

PostgreSQL’s fsync surprise (2018): in early 2018, Craig Ringer and PostgreSQL hackers found that the engine had been assuming failed fsync() calls could be safely retried. On Linux that assumption is wrong — after an I/O error the kernel can mark the dirty page clean, so a retried fsync() returns success while the data is gone.¹ Since PostgreSQL 11.2 (2019‑02), the default is to PANIC on fsync() failure (the new data_sync_retry GUC defaults to off), forcing crash recovery from the WAL rather than trusting that any in-memory state is salvageable. The same assumption held in many other engines that buffer through the page cache.

Group Commit Optimization

Individual fsync per transaction is too slow. Group commit batches multiple commits into one fsync:

1T1: BEGIN → ... → COMMIT (waiting)2T2: BEGIN → ... → COMMIT (waiting)3T3: BEGIN → ... → COMMIT (waiting)4--- fsync once for all three ---5T1, T2, T3: acknowledged

PostgreSQL’s commit_delay and commit_siblings control grouping. Under high concurrency, group commit dramatically reduces fsync overhead.

Durability vs. Performance

Setting	Durability	Performance	Use case
`fsync=on`, `sync_commit=on`	Full — committed = durable	Baseline	Financial, audit, ledger
`fsync=on`, `sync_commit=off`	Bounded loss (≤ 3× `wal_writer_delay`)	Faster commits under high concurrency; magnitude workload-dependent	Most OLTP where ~hundreds-of-ms loss is tolerable
`fsync=off`	None — corruption likely on crash	Maximum	Throwaway / dev only

Caution

The two knobs are not symmetric. synchronous_commit = off trades a bounded data-loss window (≤ 3× wal_writer_delay) for noticeably faster commits — safe for most OLTP. fsync = off trades unrecoverable on-disk corruption for raw throughput — never appropriate for production. Benchmark deltas vary widely with storage media, group-commit pressure, and workload; treat published “X% faster” numbers as upper bounds, not budget.

Isolation Level Implementations

MVCC vs. Locking

Two fundamental approaches to isolation:

Pessimistic Locking (2PL - Two-Phase Locking):

Acquire locks before accessing data
Hold locks until transaction ends
Readers block writers, writers block readers (with shared/exclusive locks)
Guarantees serializability but limits concurrency

MVCC (Multi-Version Concurrency Control):

Maintain multiple versions of each row
Readers see a consistent snapshot; never block writers
Writers create new versions; never block readers
Higher concurrency but more complex conflict detection

The hybrid reality: Most modern databases combine both. MVCC for read/write concurrency, locking for write/write conflicts.

PostgreSQL Implementation (v17)

PostgreSQL uses MVCC with full tuple versioning:

System columns on every row:

xmin: Transaction ID that created this version
xmax: Transaction ID that deleted/updated this version (0 if live)
cmin/cmax: Command IDs within transaction

Visibility rules: A row version is visible to transaction T if:

xmin is committed and less than T’s snapshot
xmax is either 0, aborted, or greater than T’s snapshot

Snapshot timing by isolation level:

Level	Snapshot Taken	Effect
Read Committed	Each statement	Sees all commits before statement
Repeatable Read	Transaction start	Sees all commits before BEGIN
Serializable	Transaction start	Plus SSI conflict detection

PostgreSQL MVCC tuple visibility across concurrent transactions at Read Committed and Repeatable Read — How xmin/xmax and snapshot timing turn 'when did your snapshot start?' into 'which row version do you see?' across Read Committed and Repeatable Read.

PostgreSQL exceeds the SQL standard: Repeatable Read in PostgreSQL prevents phantom reads (the SQL-92 standard only requires this at Serializable), and READ UNCOMMITTED is silently treated as READ COMMITTED — MVCC simply never exposes uncommitted data.²

Serializable Snapshot Isolation (SSI): PostgreSQL’s Serializable level uses SSI, which tracks read/write dependencies to detect serialization anomalies. When a cycle is detected, one transaction is aborted:

1ERROR: could not serialize access due to read/write dependencies among transactions

SSI has lower overhead than strict 2PL because it doesn’t acquire locks; it only tracks potential conflicts and aborts when necessary.

VACUUM: Dead tuples (old versions) accumulate in tables. VACUUM reclaims space. Without it, tables bloat indefinitely. Autovacuum runs automatically but may lag under heavy write load.

MySQL/InnoDB Implementation (v8.x)

InnoDB uses MVCC with undo logs:

Hidden columns per row:

DB_TRX_ID: Transaction ID of last modification
DB_ROLL_PTR: Pointer to undo log record

Key difference from PostgreSQL: Old row versions live in separate undo logs, not in the table. Reading old versions requires traversing undo chain—more overhead for long-running transactions.

Isolation level behavior:

Level	Read Behavior	Write Behavior
Read Uncommitted	Current data (no MVCC)	Standard locks
Read Committed	Fresh snapshot per statement	Record locks
Repeatable Read (default)	Transaction-start snapshot	Gap locks + record locks
Serializable	Same as RR	All reads acquire shared locks

Gap locking and next-key locking: InnoDB prevents phantoms at Repeatable Read using gap locks:

1-- T1 at Repeatable Read2SELECT * FROM orders WHERE amount > 100 FOR UPDATE;3-- Locks not just matching rows, but gaps between them4-- Prevents T2 from inserting amount=15056-- This is why MySQL RR prevents phantoms (stronger than SQL standard)

Trade-off: Gap locks can cause deadlocks and reduced concurrency. PostgreSQL’s SSI approach detects conflicts after the fact rather than preventing them with locks.

SQL Server Implementation

SQL Server offers both locking-based and MVCC-based isolation:

Pessimistic (locking) levels: Read Uncommitted, Read Committed, Repeatable Read, Serializable—traditional lock-based isolation with blocking.

Optimistic (MVCC) levels:

Read Committed Snapshot Isolation (RCSI): MVCC version of Read Committed
Snapshot Isolation: Full transaction-level snapshot

RCSI and Snapshot store row versions in tempdb, not the main tables. This avoids table bloat but adds tempdb I/O.

Enabling MVCC:

1-- Enable snapshot isolation2ALTER DATABASE MyDB SET ALLOW_SNAPSHOT_ISOLATION ON;34-- Make Read Committed use snapshots by default5ALTER DATABASE MyDB SET READ_COMMITTED_SNAPSHOT ON;

Isolation Level Comparison Matrix

Aspect	PostgreSQL	MySQL/InnoDB	SQL Server
Default level	Read Committed	Repeatable Read	Read Committed
MVCC storage	Heap (in-table)	Undo logs	tempdb
Cleanup	VACUUM	Purge threads	Version store cleanup
Phantom prevention	SSI at Serializable	Gap locks at RR	Locks or MVCC
Read Uncommitted	= Read Committed	True dirty reads	True dirty reads
Serializable approach	SSI (optimistic)	2PL (pessimistic)	2PL (pessimistic)

Distributed Transactions

When a transaction spans multiple databases or services, local ACID is insufficient. Distributed commit protocols coordinate global atomicity.

Two-Phase Commit (2PC)

2PC ensures all participants either commit or abort together.

Protocol:

1Phase 1 - Prepare (Voting):21. Coordinator sends PREPARE to all participants32. Each participant:4   - Writes redo/undo logs5   - Acquires locks6   - Responds VOTE-YES (can commit) or VOTE-NO (must abort)78Phase 2 - Commit (Decision):93. If all voted YES:10   - Coordinator writes COMMIT to its log11   - Sends COMMIT to all participants12   - Participants commit and release locks134. If any voted NO:14   - Coordinator writes ABORT15   - Sends ROLLBACK to all participants

Two-Phase Commit happy path with the coordinator-crash blocking failure mode — 2PC: PREPARE → vote → COMMIT. If the coordinator crashes between phase 1 and phase 2, YES-voters are stuck holding locks until it recovers — this is the canonical 'blocking' problem.

The blocking problem: if the coordinator crashes after sending PREPARE but before sending COMMIT/ABORT, participants that voted YES are stuck. They hold locks, can’t commit (no decision received), can’t abort (might have been committed). They must wait for coordinator recovery — Gray’s original term for this is “blocking”, and it is the defining weakness of 2PC.

Real-world impact: 2PC is used in Java EE (JTA/XA), database-to-message-broker transactions, and cross-database joins. The blocking risk means participants can hold locks for minutes during coordinator failures, which is why most modern microservice architectures avoid it.

Timeout behavior: participants typically abort if no decision arrives within timeout. But this can cause inconsistency if the coordinator did decide to commit and later delivers the COMMIT to other participants.

Three-Phase Commit (3PC)

Skeen’s 1981 “Nonblocking Commit Protocols” proposed an extra “pre-commit” phase to avoid 2PC’s blocking under site failures:

Protocol:

1Phase 1 - CanCommit:21. Coordinator asks "Can you commit?"32. Participants respond YES/NO45Phase 2 - PreCommit:63. If all YES: Coordinator sends PRECOMMIT74. Participants acknowledge and prepare to commit89Phase 3 - DoCommit:105. Coordinator sends COMMIT116. Participants commit

Why 3PC doesn’t solve the fundamental problem: Skeen’s original protocol explicitly assumes a fail-stop model (nodes crash, don’t return wrong answers) and a reliable network that does not partition. In real networks none of those assumptions hold:

Network partitions can make a live coordinator appear dead.
Two partitions might independently elect different coordinators and decide differently.
Messages can be delayed beyond timeout and then arrive.

The standard impossibility result — that no atomic-commit protocol is non-blocking in the presence of both site failures and network partitions — applies to 3PC as much as 2PC.

Practical result: 3PC is rarely used in production. The added round-trip and complexity do not eliminate blocking in realistic failure scenarios, so most systems either accept 2PC’s blocking risk or avoid synchronous distributed commit entirely.

Google Spanner: TrueTime and External Consistency

Spanner achieves something remarkable: globally distributed, externally consistent transactions without the blocking problems of 2PC.

The insight: If you can bound clock uncertainty, you can assign globally meaningful timestamps without coordination.

TrueTime API (Corbett et al., OSDI 2012):

TT.now() returns an interval [earliest, latest] guaranteed to contain the true wall-clock time.
The uncertainty bound ε is a sawtooth: in the original paper Google reports it varying between 1 and 7 ms over a 30-second poll interval (≈ 4 ms most of the time), with the implementation “generally less than 10 ms”.
If two intervals don’t overlap, the calls were definitely ordered in real time.

Spanner’s commit protocol:

Prepare phase: Participants prepare and report their prepare timestamps
Timestamp assignment: Coordinator chooses commit timestamp > all prepare timestamps
Commit wait: Coordinator waits until TT.now().earliest > commit_timestamp
Commit: Transaction becomes visible

Why commit wait matters: The wait ensures that any transaction starting “after” this one (in wall-clock time) will see this transaction’s effects. Without commit wait, a transaction could commit at timestamp T, but a later transaction might get timestamp T-1 due to clock skew.

External consistency guarantee: for any two transactions T1 and T2, if T1 commits before T2 starts, T2’s timestamp > T1’s timestamp. This is stronger than serializability — it respects real-time ordering.

Spanner's commit wait converts TrueTime uncertainty into external consistency — Spanner picks a commit timestamp s, then blocks until TT.now().earliest > s. The wait is paid in commit latency, but it guarantees that any later real-time transaction gets a timestamp > s.

Trade-off accepted: commit latency includes the commit wait (roughly 2 × ε on average — single-digit milliseconds with Google’s clocks). For most workloads with tight TrueTime bounds, this is negligible compared to the WAN round trips a multi-region transaction already pays.

Calvin: Deterministic Transaction Protocol

Calvin takes a radically different approach: avoid 2PC by making execution deterministic.

Key insight: If all replicas execute the same transactions in the same order, they’ll reach the same state without coordination.

Protocol:

Sequencer: Batches transactions into a global order, replicates log via Paxos
Scheduler: Each replica reads the log and executes in order
Execution: Deterministic—same inputs = same outputs

Why it avoids 2PC: In 2PC, coordinator failure is problematic because participants don’t know the decision. In Calvin, if a node fails, it recovers by replaying the deterministic log from checkpoint. No commit protocol needed—execution is deterministic.

Limitations:

Transactions must be submitted as complete units (no interactive transactions)
Read set and write set must be known before execution
Not suitable for transactions that depend on external input mid-execution

Real-world: FaunaDB shipped a Calvin-derived protocol³ and was the most prominent commercial product to do so; the hosted service shut down on 2025‑05‑30. The protocol suits specific workloads, but the requirement that read- and write-sets be known up-front rules out interactive transactions and limits general applicability.

Alternatives to Distributed Transactions

When distributed transactions are too expensive or don’t fit the architecture, alternatives provide eventual atomicity.

Saga Pattern

A saga is a sequence of local transactions with compensating transactions for rollback.

Example: E-commerce order saga:

11. Order Service: Create order (PENDING)2   Compensation: Cancel order342. Payment Service: Reserve payment5   Compensation: Release payment hold673. Inventory Service: Reserve items8   Compensation: Release reservation9104. Shipping Service: Schedule shipment11   Compensation: Cancel shipment12135. Order Service: Mark COMPLETED

If step 3 fails (items out of stock), execute compensations in reverse: release payment hold, cancel order.

Orchestrated saga showing forward steps and reverse compensations on failure — An orchestrated saga: each forward step has a compensating step. On failure, the orchestrator runs the compensations in reverse — semantic undo, not row-level rollback.

Two coordination approaches:

Choreography (event-driven):

Services publish events; others react
No central coordinator
✅ Loosely coupled, simpler for small sagas
❌ Hard to track global state, complex with many steps

Orchestration (centralized):

Central orchestrator invokes services in sequence
Handles compensation logic centrally
✅ Easier to understand flow, centralized monitoring
❌ Orchestrator is SPOF, coupling to orchestrator

Compensation design challenges:

Idempotency: compensations may be retried; they must be safe to apply more than once.
Non-reversible side effects: you can’t un-send an email or un-charge a card; design semantic reversals (refund, “we cancelled order #X”) instead of mechanical undo.
No isolation between steps: concurrent sagas can observe each other’s intermediate states. If that matters (inventory, reservations), add semantic locks (status = PENDING) or pessimistic reservations at the saga boundary.

Warning

Sagas trade isolation for availability. Two sagas mid-flight can see each other’s half-applied state, so any invariant that crosses services needs an explicit reservation or a uniqueness check on the compensating path. Treat sagas as eventual-atomicity, not as drop-in 2PC replacements.

Real-world: Uber, Netflix, Airbnb use saga patterns for cross-service transactions. Frameworks like Temporal, Axon, and Camunda provide saga orchestration.

Transactional Outbox Pattern

Solves the dual-write problem: updating a database AND publishing an event must be atomic.

The problem:

11. Write to database  ✓22. Publish to Kafka   ✗ (crash here)3// Event lost, subscribers never notified

The solution:

11. In single transaction:2   - Write to business table3   - Write event to OUTBOX table42. Separate process (or CDC) reads OUTBOX, publishes to Kafka53. After successful publish, delete/mark processed in OUTBOX

Implementation options:

Polling: background job queries outbox, publishes, marks complete.
Change Data Capture (CDC): Debezium tails the WAL/binlog, captures outbox inserts, streams to Kafka.

Transactional outbox: a single ACID transaction commits both the business row and the event, an async relay publishes — Outbox pattern: the business write and the event payload commit in one local transaction. A separate poller or CDC stream publishes the event later, giving at-least-once delivery without 2PC.

Guarantees:

At-least-once delivery: If publishing succeeds, event is delivered (possibly multiple times)
Eventual consistency: Downstream sees changes after publish delay
No 2PC required: Single database transaction + async delivery

Consumer idempotency: Since events may be delivered multiple times, consumers must be idempotent. Include event ID; track processed IDs.

Event Sourcing Approach

Event sourcing stores state as a sequence of events rather than current state. This naturally integrates with the outbox pattern.

Mechanism:

1Traditional: UPDATE accounts SET balance = 150 WHERE id = 'A'23Event Sourced:4Event Store: [5  { type: 'AccountCreated', accountId: 'A', initialBalance: 100 },6  { type: 'MoneyDeposited', accountId: 'A', amount: 50 }7]8Current State: Computed by replaying events (balance = 150)

Benefits for transactions:

Events are immutable—no update anomalies
Event log IS the outbox—no dual write
Point-in-time recovery by replaying to any timestamp
Audit trail built-in

Trade-offs:

✅ Natural event publishing, audit trail, temporal queries
❌ Complex querying (must replay or maintain projections)
❌ Schema evolution of events is challenging
❌ Eventual consistency for read models

Real-world: Banking systems (immutable transactions fit naturally), CQRS architectures, audit-heavy domains.

Real-World Examples and Trade-offs

When Companies Choose Distributed Transactions

Google Spanner: Global financial systems, multi-region consistency requirements. Spanner’s TrueTime investment (GPS, atomic clocks in every datacenter) is justified by the need for external consistency across continents.

CockroachDB: teams wanting Spanner-like guarantees without Google’s hardware. Uses Hybrid Logical Clocks with NTP and a configurable max-offset (default 500 ms). Reads landing inside the uncertainty interval restart at a higher timestamp; nodes that drift past 80 % of max-offset shut themselves down to protect ACID. The trade is “no atomic clocks, but more transaction restarts and a wider uncertainty window than Spanner’s single-digit-millisecond ε.”

When Companies Avoid Transactions

Netflix: Uses Cassandra (eventually consistent) for viewing history. Reasoning: “Since we are dealing with movie data and not financial data, the world won’t end if there are temporary inconsistencies.” They run continuous consistency checkers to detect and fix divergence.

Amazon: Strong consistency for cart/checkout, eventual consistency for product catalog. The shopping cart uses Dynamo’s quorum reads for checkout but tolerates stale inventory counts on product pages.

Stripe: Idempotency Keys for Payment Reliability

Stripe’s approach to transaction reliability uses idempotency keys rather than distributed transactions:

Mechanism:

1POST /v1/charges2Idempotency-Key: "order-123-attempt-1"34// First request: processes charge, stores result with key5// Retry (same key): returns stored result, no double-charge

Implementation:

Client generates a unique idempotency key per logical attempt.
Server checks whether the key has been seen.
If it has: return the cached result with the original status code.
If it is new: process the request, store the result keyed by the idempotency key.
Stripe retains keys for at least 24 hours, then prunes them; later retries with the same key are treated as new requests.

Why this works for payments:

Retries are safe (network failures don’t cause double charges)
No distributed transaction with Stripe’s banking partners
Local ACID + idempotency achieves practical atomicity

Trade-off: requires disciplined client-side key management. Once Stripe’s retention window passes, a retry with the same key is no longer deduplicated — long retry queues need their own idempotency layer on top.

Write skew in the wild: While specific public incidents are rare, write skew bugs are common in booking systems. A classic pattern: two users simultaneously booking the last seat, both checking availability before either writes. Without serializable isolation, both succeed.

The fsync saga (2018): PostgreSQL’s assumption that failed fsync could be retried was wrong on Linux. After a storage error, retrying fsync might return success while data was actually lost. Discovery led to PostgreSQL treating fsync failures as fatal—a database restart is required.

How to Choose

Decision Framework

Requirement	Solution
Single database, strong consistency	Use highest isolation level needed (often Serializable for financial)
Single database, high throughput	Read Committed + application-level conflict handling
Cross-service atomicity, low latency tolerance	Saga pattern with idempotent compensations
Database + message broker atomicity	Transactional outbox + CDC
Global strong consistency, can invest in infrastructure	Spanner or CockroachDB
Eventual consistency acceptable	Leaderless replication (Cassandra, DynamoDB)

Isolation Level Selection Guide

Anomaly Risk	Minimum Level	Notes
Dirty reads unacceptable	Read Committed	Almost always—dirty reads cause application bugs
Non-repeatable reads problematic	Repeatable Read	Analytics, reports, multi-step reads
Phantoms problematic	Serializable (PostgreSQL RR prevents phantoms)	Aggregations, existence checks
Write skew possible	Serializable	Any read-then-write pattern with constraints

Performance Considerations

Serializable isn’t always slow: SSI (PostgreSQL) has lower overhead than pessimistic 2PL. Benchmark your workload—the difference may be smaller than expected.

fsync tuning: synchronous_commit = off provides most of fsync performance benefit with bounded data loss risk (no corruption). Acceptable for many OLTP workloads.

Hot path optimization: Use strong consistency for critical paths (payment, inventory decrement); relax for read-heavy paths (product display, recommendations).

Common Pitfalls

1. Assuming Default Isolation Is Sufficient

The mistake: Using database default (often Read Committed) without analyzing anomaly risk.

Why it happens: “It’s the default, must be safe.”

The consequence: Lost updates, write skew, inconsistent reads in reports.

The fix: Analyze each transaction’s read/write patterns. Identify where anomalies cause business impact. Explicitly choose isolation level.

2. Distributed Transactions in Microservices

The mistake: Using 2PC across microservices for “consistency.”

Why it happens: Attempting to preserve monolith transaction guarantees after decomposition.

The consequence: Tight coupling, availability degradation, coordinator bottleneck.

The fix: Accept eventual consistency with sagas. Design compensating transactions. Embrace idempotency.

3. Ignoring Compensation Complexity in Sagas

The mistake: Assuming any operation can be “undone.”

Why it happens: Treating compensation as simple reversal.

The consequence: Sent emails can’t be un-sent. Charged cards require refunds (not reversal). Published events can’t be un-published.

The fix: Design for semantic reversal, not mechanical undo. “Cancel order” is not “DELETE FROM orders.”

4. Missing Idempotency in Distributed Systems

The mistake: Retrying failed operations without idempotency protection.

Why it happens: “Retries make it reliable.”

The consequence: Double charges, duplicate orders, overcounted metrics.

The fix: Every retry-able operation needs an idempotency key. Track processed keys. Design operations to be safe for re-execution.

5. fsync Disabled in Production

The mistake: Disabling fsync for performance without understanding risk.

Why it happens: “10x faster writes!”

The consequence: Unrecoverable data corruption after power failure. Not just data loss—the entire database may become unusable.

The fix: Never disable fsync in production. Use synchronous_commit = off for performance with acceptable bounded data loss.

Conclusion

Transactions provide essential guarantees for reliable data operations, but those guarantees come with trade-offs:

Single-node ACID is well-understood:

WAL provides atomicity and durability with excellent performance
MVCC enables high concurrency without read/write blocking
Isolation levels trade consistency for throughput—choose based on anomaly risk

Distributed transactions are expensive:

2PC adds latency and blocks on coordinator failure
3PC doesn’t solve network partition problems
Spanner’s TrueTime achieves external consistency but requires specialized infrastructure

Practical alternatives often suffice:

Sagas provide eventual atomicity for microservices
Outbox pattern solves dual-write reliably
Idempotency keys enable safe retries without distributed coordination

The right choice depends on:

Consistency requirements: What’s the cost of anomalies?
Latency budget: How much coordination overhead is acceptable?
Failure tolerance: Can the system block waiting for recovery?

Most systems need different transaction strategies for different data—strong consistency for critical paths, eventual consistency for others.

Appendix

Prerequisites

Basic understanding of database operations (SQL, transactions)
Familiarity with distributed systems concepts (replication, partitioning)
Understanding of failure modes (crashes, network partitions)

Terminology

WAL (Write-Ahead Log): Append-only log of changes for atomicity and recovery
MVCC (Multi-Version Concurrency Control): Maintaining multiple row versions for concurrent access
SSI (Serializable Snapshot Isolation): Optimistic serializable isolation via conflict detection
2PC (Two-Phase Commit): Distributed commit protocol with prepare and commit phases
Saga: Sequence of local transactions with compensating actions for rollback
Outbox Pattern: Storing events in database table for reliable async publishing
Idempotency: Property where repeated execution has same effect as single execution
External Consistency: Transactions ordered by real-time (stronger than serializability)
TrueTime: Google’s API providing bounded clock uncertainty via GPS/atomic clocks
Compensation: Operation that semantically reverses an earlier operation

Summary

ACID properties address distinct failure modes: atomicity (partial failures), consistency (invariant violations), isolation (concurrency anomalies), durability (data loss)
WAL is the dominant atomicity mechanism: log first, apply later; ARIES recovery repeats history then undoes selectively
Isolation is a spectrum: Read Committed for most OLTP; Serializable for financial/constraint-heavy workloads
PostgreSQL/MySQL exceed SQL standard: Both prevent phantoms at Repeatable Read; PostgreSQL never allows dirty reads
2PC blocks on coordinator failure: Acceptable for batch operations; problematic for real-time systems
Sagas trade isolation for availability: No distributed locking, but requires idempotent compensations
Outbox pattern solves dual-write: Single database transaction + async delivery = reliable event publishing
Idempotency is essential: Any retry-able operation needs duplicate protection

References

Foundational Papers and Specifications

ARIES: A Transaction Recovery Method - Mohan et al., IBM Research, 1992. The definitive WAL recovery algorithm.
A Critique of ANSI SQL Isolation Levels - Berenson et al., 1995. Expanded anomaly definitions beyond SQL-92.
Serializable Isolation for Snapshot Databases - Cahill et al., 2008. SSI algorithm used by PostgreSQL.

Database Documentation

PostgreSQL Transaction Isolation - Official PostgreSQL 17 documentation on isolation levels
MySQL InnoDB Transaction Model - Official MySQL 8.0 documentation
SQL Server Isolation Levels - Microsoft documentation

Distributed Transaction Systems

Spanner: Google’s Globally-Distributed Database - Corbett et al., 2012. TrueTime and external consistency.
Spanner, TrueTime & The CAP Theorem - Brewer’s clarification on Spanner and CAP.
Calvin: Fast Distributed Transactions for Partitioned Database Systems - Thomson et al., 2012. Deterministic transaction protocol.
CockroachDB: Living Without Atomic Clocks - How CockroachDB achieves consistency without TrueTime.

Patterns and Alternatives

Sagas - Microservices.io pattern documentation
Transactional Outbox Pattern - Microservices.io pattern documentation
Designing Robust and Predictable APIs with Idempotency - Stripe’s approach to payment reliability
Implementing Stripe-like Idempotency Keys in Postgres - Detailed implementation guide

Engineering Blog Posts

Netflix: Consistency Checkers - How Netflix handles eventual consistency
PostgreSQL’s fsync() Surprise - LWN coverage of the 2018 fsync issue
What Write Skew Looks Like - CockroachDB’s explanation with examples

Books

Designing Data-Intensive Applications - Kleppmann, 2017. Chapters 7 (Transactions) and 9 (Consistency and Consensus).
Database Internals - Petrov, 2019. Deep coverage of storage engines and distributed systems.

LWN: PostgreSQL’s fsync() surprise (2018) and PostgreSQL wiki: Fsync Errors. ↩
PostgreSQL Documentation: 13.2. Transaction Isolation — “in PostgreSQL READ UNCOMMITTED is treated as READ COMMITTED” and “the Repeatable Read isolation level only allows … reads of data that were committed before the transaction began”, which excludes phantoms. ↩
Jepsen analysis: FaunaDB 2.5.4 confirms the Calvin lineage. ↩