Event Sourcing

A deep-dive into event sourcing: understanding the core pattern, implementation variants, snapshot strategies, schema evolution, and production trade-offs across different architectures.

Event sourcing architecture: commands produce events stored immutably; projections derive read models; state rebuilt by replaying events from snapshots.

Abstract

Event sourcing stores application state as a sequence of immutable events rather than current state. The core insight: events are facts that happened—they cannot be deleted or modified, only appended. State is derived by replaying events (a “left fold” over the event stream).

Key mental model:

Write path: Commands → Validation → Event(s) → Append to stream
Read path: Subscribe to events → Project into read-optimized models
Recovery: Snapshot + Replay events since snapshot = Current state

Design decisions after choosing event sourcing:

Stream granularity: Per-aggregate (common) vs shared streams (rare)
Projection strategy: Synchronous (strong consistency) vs async (eventual)
Snapshot policy: Event-count vs time-based vs state-triggered
Schema evolution: Upcasting (transform on read) vs stream transformation (migrate data)

Event sourcing is not CQRS (Command Query Responsibility Segregation), but with event sourcing you must use CQRS—reads come from projections, not the event store.

The Problem

Why Naive Solutions Fail

Approach 1: Mutable state in a database (CRUD)

Fails because: Current state overwrites history. You cannot answer “what was the balance at 3pm yesterday?” without separate audit logging.
Example: A bank account shows $500. A customer disputes a charge. Without event history, proving what transactions occurred requires mining application logs—if they exist.

Approach 2: Adding audit tables alongside CRUD

Fails because: Audit tables and current state can diverge. The audit is a second-class citizen—queries hit current state, audits are bolted on.
Example: A bug updates the balance without writing to the audit table. Now the audit and state disagree. Which is correct?

Approach 3: Database triggers for audit logging

Fails because: Triggers capture “what changed,” not “why it changed.” You see “balance changed from 600 to 500” but not “customer withdrew $100 at ATM #4421.”
Example: Debugging why an order was cancelled. Trigger logs show “status changed to CANCELLED” but not “cancelled by customer due to shipping delay.”

The Core Challenge

The fundamental tension: optimized writes vs queryable history. CRUD optimizes for writes (overwrite current state) but sacrifices history. Audit logging preserves history but creates dual-write complexity and semantic loss.

Event sourcing exists to make events the single source of truth—current state becomes a derived view that can be rebuilt at any time.

Pattern Overview

Core Mechanism

Every state change is captured as an immutable event appended to an event stream. Current state is derived by replaying events from the beginning (or from a snapshot).

From Martin Fowler’s foundational definition:

“Capture all changes to an application state as a sequence of events.”

Events are structured facts:


3 collapsed lines
1
// Event structure
2
interface DomainEvent {
3
  eventId: string // Unique identifier
4
  streamId: string // Aggregate this event belongs to
5
  eventType: string // e.g., "OrderPlaced", "ItemAdded"
6
  data: unknown // Event-specific payload
7
  metadata: {
8
    timestamp: Date
9
    version: number // Position in stream
10
    causationId?: string // What caused this event
11
    correlationId?: string // Request/saga identifier
12
  }
13
}

State reconstruction follows a deterministic left-fold:

1
currentState = events.reduce(applyEvent, initialState)

Key Invariants

Append-only: Events are never updated or deleted. The event store is an immutable log.
Deterministic replay: Given the same events in the same order, replay produces identical state.
Event ordering: Events within a stream are totally ordered by version number.
Idempotent projections: Projections must handle duplicate event delivery gracefully.

Failure Modes

Failure	Impact	Mitigation
Event store unavailable	Commands blocked; reads from projections may continue	Multi-region replication; read replicas
Projection lag	Stale read models	Monitor lag; circuit breaker on staleness threshold
Event schema mismatch	Projection crashes or produces incorrect state	Schema registry; upcasting; versioned projections
Unbounded event growth	Replay becomes prohibitively slow	Snapshots; event archival to cold storage
Concurrent writes to same stream	Optimistic concurrency violation	Retry with conflict resolution; smaller aggregates

Design Paths

Path 1: Pure Event Sourcing

When to choose this path:

Audit trail is a regulatory requirement (finance, healthcare)
Temporal queries are core functionality (“what was the state on date X?”)
Domain naturally expresses state changes as events
You need to rebuild projections for new query patterns

Key characteristics:

Events are the only source of truth
All read models are projections derived from events
No direct database writes outside the event store
State can be fully rebuilt from events at any time

Implementation approach:


5 collapsed lines
1
// Pure event sourcing: aggregate loaded from events
2
interface Aggregate<TState, TEvent> {
3
  state: TState
4
  version: number
5
}
6

7
// Command handler pattern
8
async function handle<TState, TEvent>(
9
  command: Command,
10
  loadAggregate: (id: string) => Promise<Aggregate<TState, TEvent>>,
11
  decide: (state: TState, command: Command) => TEvent[],
12
  appendEvents: (streamId: string, expectedVersion: number, events: TEvent[]) => Promise<void>,
13
): Promise<void> {
14
  const aggregate = await loadAggregate(command.aggregateId)
11 collapsed lines
15
  const newEvents = decide(aggregate.state, command)
16
  await appendEvents(command.aggregateId, aggregate.version, newEvents)
17
}
18

19
// Loading aggregate from event store
20
async function loadAggregate<TState, TEvent>(
21
  streamId: string,
22
  eventStore: EventStore,
23
  evolve: (state: TState, event: TEvent) => TState,
24
  initialState: TState,
25
): Promise<Aggregate<TState, TEvent>> {
26
  const events = await eventStore.readStream(streamId)
27
  const state = events.reduce(evolve, initialState)
28
  return { state, version: events.length }
29
}

Real-world example:

LMAX Exchange processes 6 million orders per second using pure event sourcing. Their Business Logic Processor runs entirely in-memory, single-threaded. Events are the recovery mechanism—on restart, they replay from last snapshot plus subsequent events to reconstruct state in milliseconds.

Key implementation details:

Ring buffer (Disruptor) handles I/O concurrency: 20 million slots for input
Latency: sub-50 nanosecond event processing (mean latency 3 orders of magnitude below queue-based approaches)
Nightly snapshots; BLP restarts every night with zero downtime via replication

Trade-offs vs other paths:

Aspect	Pure Event Sourcing	Hybrid (ES + CRUD)
Audit completeness	Full history guaranteed	Partial—only ES domains audited
Query flexibility	Build any projection from events	Some data may not be projectable
Complexity	Higher—must project all reads	Lower—CRUD for simple domains
Migration cost	Very high (all-or-nothing)	Incremental
Team expertise required	Event-driven thinking throughout	Can isolate ES to specific bounded contexts

Path 2: Hybrid Event Sourcing (ES + CRUD)

When to choose this path:

Some domains require audit trails; others don’t
Team has mixed experience with event-driven architecture
Migrating from existing CRUD system incrementally
Read-heavy workloads where projection lag is unacceptable for some data

Key characteristics:

Core business domains use event sourcing (orders, transactions, inventory)
Supporting domains use CRUD (user preferences, product catalog)
Clear boundaries between ES and CRUD contexts
Events may be published from CRUD systems for integration

Implementation approach:


3 collapsed lines
1
// Hybrid: Event-sourced orders, CRUD product catalog
2

3
// Order aggregate (event-sourced)
4
class OrderAggregate {
5
  apply(event: OrderEvent): void {
6
    switch (event.type) {
7
      case "OrderPlaced":
8
        this.status = "placed"
9
        this.items = event.data.items
10
        break
11
      case "OrderShipped":
12
        this.status = "shipped"
13
        break
14
    }
15
  }
16
}
17

18
// Product service (CRUD with event publishing)
19
class ProductService {
20
  async updatePrice(productId: string, newPrice: number): Promise<void> {
21
    // CRUD update
22
    await this.db.products.update(productId, { price: newPrice })
23
    // Publish event for downstream consumers (not source of truth)
24
    await this.eventBus.publish({ type: "PriceUpdated", productId, newPrice })
25
  }
26
}

Real-world example:

Walmart’s Inventory Availability System uses hybrid event sourcing. The inventory domain is fully event-sourced to track every stock movement. Product catalog data (descriptions, images) remains in traditional databases. Events from inventory updates feed projections that power real-time availability queries for millions of customers.

Key details:

Events partitioned by product-node ID for scalability
Command processor validates business rules before event emission
Read and write sides use different tech stacks, scaled independently

Trade-offs vs pure ES:

Aspect	Hybrid	Pure ES
Migration path	Gradual, bounded context by context	Big-bang or strangler pattern
Team adoption	Easier—ES expertise not required everywhere	Requires organization-wide buy-in
Consistency	Mixed—ES domains eventual, CRUD immediate	Eventual consistency throughout
Operational complexity	Higher—two paradigms to operate	Single paradigm (but complex)

Path 3: Event Sourcing with Synchronous Projections

When to choose this path:

Strong consistency between writes and reads required
Projection latency is unacceptable
Lower throughput acceptable for consistency guarantee
Single-node or low-scale deployments

Key characteristics:

Projections updated in same transaction as event append
Read model immediately reflects write
No eventual consistency concerns
Limited scalability—projection work blocks command processing

Implementation approach:


2 collapsed lines
1
// Synchronous projection: update read model in same transaction
2
async function handleCommand(command: PlaceOrderCommand): Promise<void> {
3
  await db.transaction(async (tx) => {
4
    // 1. Load aggregate
5
    const order = await loadFromEvents(tx, command.orderId)
6

7
    // 2. Execute business logic
8
    const events = order.place(command.items)
9

10
    // 3. Append events
11
    await appendEvents(tx, command.orderId, events)
12

13
    // 4. Update projection synchronously
14
    for (const event of events) {
15
      await updateOrderListProjection(tx, event)
16
      await updateInventoryProjection(tx, event)
17
    }
3 collapsed lines
18
  })
19
  // Transaction commits: events + projections atomically
20
}

Real-world example:

Marten (for .NET/PostgreSQL) supports inline projections that run within the same database transaction as event appends. This guarantees that when a command succeeds, the read model is immediately consistent.

From Marten documentation: inline projections are “run as part of the same transaction as the events being captured and updated in the underlying database as well.”

Trade-offs:

Aspect	Synchronous Projections	Async Projections
Consistency	Strong—reads reflect writes immediately	Eventual—lag between write and read
Throughput	Lower—projection work blocks writes	Higher—writes return without waiting
Scalability	Limited by projection cost	Projections scale independently
Failure isolation	Projection failure fails the write	Write succeeds; projection can retry

Path 4: Event Sourcing with Asynchronous Projections

When to choose this path:

High write throughput required
Eventual consistency acceptable
Projections are expensive (aggregations, joins)
Need to scale reads independently of writes

Key characteristics:

Events appended to store; command returns immediately
Background processes subscribe to event streams
Projections catch up asynchronously
Must handle out-of-order delivery, duplicates, and lag

Implementation approach:


4 collapsed lines
1
// Async projection with checkpointing
2
interface ProjectionState {
3
  lastProcessedPosition: number
4
  // projection-specific state
5
}
6

7
async function runProjection(
8
  subscription: EventSubscription,
9
  project: (state: ProjectionState, event: DomainEvent) => ProjectionState,
10
  checkpoint: (position: number) => Promise<void>,
11
): Promise<void> {
12
  let state = await loadProjectionState()
13

14
  for await (const event of subscription.fromPosition(state.lastProcessedPosition)) {
15
    state = project(state, event)
16

17
    // Checkpoint periodically (not every event—batching for performance)
18
    if (event.position % 100 === 0) {
19
      await checkpoint(event.position)
9 collapsed lines
20
    }
21
  }
22
}
23

24
// Idempotent projection handling duplicates
25
function projectOrderPlaced(state: OrderListState, event: OrderPlacedEvent): OrderListState {
26
  // Skip if already processed (idempotency)
27
  if (state.processedEventIds.has(event.eventId)) {
28
    return state
29
  }
30

31
  return {
32
    ...state,
33
    orders: [...state.orders, { id: event.data.orderId, status: "placed" }],
34
    processedEventIds: state.processedEventIds.add(event.eventId),
35
  }
36
}

Real-world example:

Netflix’s Downloads feature (launched November 2016) uses Cassandra-backed event sourcing with async projections. The team built this in six months, transforming from a stateless to stateful service.

Key implementation details:

Three components: Event Store, Aggregate Repository, Aggregate Service
Projections rebuild on-demand for debugging
Horizontal scaling limited only by disk space

Trade-offs:

Aspect	Async Projections	Sync Projections
Write latency	Low—return immediately	Higher—wait for projections
Read freshness	Eventual (seconds to minutes lag)	Immediate
Failure isolation	Yes—projection failures don’t block writes	No—projection failure fails command
Complexity	Higher—handle duplicates, ordering, lag	Lower—transactional guarantees

Decision Framework

Snapshot Strategies

Replaying thousands of events for every command becomes prohibitive. Snapshots store materialized state at a point in time, enabling replay from snapshot + recent events.

When to Snapshot

From the EventStoreDB team: “Do not add snapshots right away. They are not needed at the beginning.”

Add snapshots when:

Aggregate event counts exceed hundreds (measure first)
Load times impact user experience or SLAs
Cold start times are unacceptable

Snapshot Approaches

Strategy	Trigger	Pros	Cons
Event-count	Every N events (e.g., 100)	Predictable load times	May snapshot unnecessarily
Time-based	Every N hours/days	Operational simplicity	Variable event counts between snapshots
State-triggered	On significant transitions (draft→published)	Snapshots at natural boundaries	Requires domain knowledge
On-demand	When load time exceeds threshold	Only when needed	First slow load before snapshot exists

Implementation


5 collapsed lines
1
// Snapshot-aware aggregate loading
2
interface Snapshot<TState> {
3
  state: TState
4
  version: number
5
  schemaVersion: number // For schema evolution
6
}
7

8
async function loadWithSnapshot<TState, TEvent>(
9
  streamId: string,
10
  eventStore: EventStore,
11
  snapshotStore: SnapshotStore,
12
  evolve: (state: TState, event: TEvent) => TState,
13
  initialState: TState,
14
): Promise<{ state: TState; version: number }> {
15
  // Try loading snapshot first
16
  const snapshot = await snapshotStore.load<TState>(streamId)
17

18
  const startVersion = snapshot?.version ?? 0
19
  const startState = snapshot?.state ?? initialState
20

21
  // Replay only events after snapshot
9 collapsed lines
22
  const events = await eventStore.readStream(streamId, { fromVersion: startVersion + 1 })
23
  const state = events.reduce(evolve, startState)
24

25
  return { state, version: startVersion + events.length }
26
}
27

28
// Snapshot after command if threshold exceeded
29
async function maybeSnapshot<TState>(
30
  streamId: string,
31
  state: TState,
32
  version: number,
33
  snapshotStore: SnapshotStore,
34
  threshold: number = 100,
35
): Promise<void> {
36
  const lastSnapshot = await snapshotStore.getVersion(streamId)
37
  if (version - lastSnapshot > threshold) {
38
    await snapshotStore.save(streamId, { state, version, schemaVersion: 1 })
39
  }
40
}

Schema Evolution for Snapshots

When snapshot schema changes:

Increment schemaVersion in new snapshots
On load, check schemaVersion
If outdated, discard snapshot and rebuild from events

This is simpler than migrating snapshots—events are the source of truth; snapshots are ephemeral optimization.

Event Schema Evolution

Events are immutable and stored forever. Schema changes must be backward-compatible or handled via transformation.

Core Principle

From Greg Young (EventStoreDB creator):

“A new version of an event must be convertible from the old version. If not, it is not a new version but a new event.”

Evolution Strategies

Strategy 1: Additive Changes (Preferred)

Add optional fields; never remove or rename.


3 collapsed lines
1
// Version 1
2
interface OrderPlacedV1 {
3
  orderId: string
4
  customerId: string
5
  items: OrderItem[]
6
}
7

8
// Version 2: Added optional field
9
interface OrderPlacedV2 {
10
  orderId: string
11
  customerId: string
12
  items: OrderItem[]
13
  discountCode?: string // New optional field
14
}
15

16
// Projection handles both
17
function projectOrder(event: OrderPlacedV1 | OrderPlacedV2): Order {
18
  return {
19
    id: event.orderId,
20
    customerId: event.customerId,
21
    items: event.items,
22
    discountCode: "discountCode" in event ? event.discountCode : undefined,
23
  }
24
}

Strategy 2: Upcasting (Transform on Read)

Transform old events to new schema when reading.


4 collapsed lines
1
// Upcaster transforms old schema to current schema
2
type Upcaster<TOld, TNew> = (old: TOld) => TNew
3

4
const orderPlacedUpcasters: Map<number, Upcaster<unknown, OrderPlacedV3>> = new Map([
5
  [
6
    1,
7
    (v1: OrderPlacedV1) => ({
8
      ...v1,
9
      discountCode: undefined,
10
      source: "unknown", // Default for old events
11
    }),
12
  ],
13
  [
14
    2,
15
    (v2: OrderPlacedV2) => ({
16
      ...v2,
17
      source: "unknown",
5 collapsed lines
18
    }),
19
  ],
20
])
21

22
function upcast(event: StoredEvent): DomainEvent {
23
  const upcaster = orderPlacedUpcasters.get(event.schemaVersion)
24
  return upcaster ? upcaster(event.data) : event.data
25
}

When to use: Schema changes are frequent; you want projections to only handle the latest version.

Trade-off: CPU cost on every read; upcaster chain grows over time.

Strategy 3: Stream Transformation (Migrate Data)

Rewrite the event stream with transformed events during a release window.

When to use: Breaking changes that upcasting cannot handle; reducing technical debt from long upcaster chains.

Trade-off: Requires downtime or careful blue-green deployment; must preserve event IDs and ordering.

Warning from Greg Young: “Updating an existing event can cause large problems.” Prefer stream transformations (copy with transform) over in-place modification.

Schema Registry

For production systems, use a schema registry:

Store event schemas with versions
Validate events against schema on write
Reject events that don’t conform
Generate types from schemas (Avro, Protobuf, JSON Schema)

Projections and Read Models

Projections transform event streams into query-optimized read models.

Projection Lifecycle (Marten Model)

Type	Execution	Consistency	Use Case
Inline	Same transaction as events	Strong	Critical reads; simple projections
Async	Background process	Eventual	Complex aggregations; high throughput
Live	On-demand, not persisted	None (computed)	One-time analytics; debugging

Building Projections


6 collapsed lines
1
// Projection as left-fold over events
2
interface Projection<TState> {
3
  initialState: TState
4
  apply: (state: TState, event: DomainEvent) => TState
5
}
6

7
// Order list projection for dashboard
8
const orderListProjection: Projection<OrderListState> = {
9
  initialState: { orders: [], totalRevenue: 0 },
10

11
  apply(state, event) {
12
    switch (event.type) {
13
      case "OrderPlaced":
14
        return {
15
          ...state,
16
          orders: [...state.orders, { id: event.data.orderId, status: "placed", total: event.data.total }],
17
          totalRevenue: state.totalRevenue + event.data.total,
18
        }
19
      case "OrderShipped":
20
        return {
21
          ...state,
22
          orders: state.orders.map((o) => (o.id === event.data.orderId ? { ...o, status: "shipped" } : o)),
23
        }
24
      case "OrderRefunded":
25
        return {
26
          ...state,
27
          orders: state.orders.map((o) => (o.id === event.data.orderId ? { ...o, status: "refunded" } : o)),
28
          totalRevenue: state.totalRevenue - event.data.refundAmount,
29
        }
30
      default:
31
        return state
32
    }
33
  },
34
}

Rebuilding Projections

Since events are source of truth, projections can be rebuilt at any time:

Drop existing projection state
Replay all events through projection logic
New projection reflects current business logic

Use cases:

Bug fix in projection logic
New read model for new query pattern
Schema change in read database

Warning: Rebuilding from millions of events takes time. For Netflix Downloads, “re-runs took DAYS to complete” during development—mitigated by snapshots and event archival.

Cross-Projection Dependencies

Problem: Projection A depends on Projection B’s state. During replay, B may not be at the same position as A.

From Dennis Doomen’s production experience:

“Rebuilding projection after schema upgrade caused projector to read from another projection that was at a different state in the event store’s history.”

Solutions:

Single-pass projections: Project to denormalized models; avoid joins at projection time
Explicit dependencies: Projections declare dependencies; rebuild engine ensures correct ordering
Position tracking: Each projection tracks which events it has seen; dependent projections wait

Event Store Technologies

EventStoreDB (KurrentDB)

Purpose-built for event sourcing.

Performance characteristics:

15,000+ writes/second
50,000+ reads/second
Quorum-based replication (majority must acknowledge)

Strengths:

Native event sourcing primitives (streams, subscriptions, projections)
Optimistic concurrency per stream
Persistent subscriptions with checkpointing
Built-in projections in JavaScript

Limitations:

No built-in sharding (must implement at application level)
Performance bounded by disk I/O
Read-only replicas require cluster connection

Apache Kafka

Designed for event streaming, not event sourcing.

Why Kafka is problematic for event sourcing (from Serialized.io):

Topic scalability: Not designed for millions of topics (aggregates can easily reach millions)
Entity loading: No practical way to load events for specific entity by ID within a topic
No optimistic concurrency: Cannot do “save this event only if version is still X”
Compaction destroys history: Log compaction keeps only latest value per key—antithetical to event sourcing

Where Kafka fits: Transport layer for events to downstream consumers. Use dedicated event store for source of truth, Kafka for integration.

PostgreSQL-Backed Stores

Use relational database as event store.

Implementation patterns:

Transactional outbox: Events + state change in same transaction
LISTEN/NOTIFY: PostgreSQL async notifications for real-time subscriptions
Advisory locks: Application-level locking for projection synchronization

Libraries: Marten (.NET), pg-event-store (Node.js), Eventide (Ruby)

Trade-offs:

Aspect	PostgreSQL	EventStoreDB
Operational familiarity	High (existing expertise)	New technology to learn
Event sourcing primitives	Must build or use library	Native
Transactions	Full ACID with application data	Event store separate from app DB
Scaling	Traditional DB scaling	Purpose-built but limited sharding

Cloud Services

AWS EventBridge: Event routing and orchestration; not an event store. 90+ AWS service integrations. Use for serverless event-driven architectures, not as source of truth.

Azure Event Hubs: High-volume event ingestion (millions/second). Kafka-compatible. Geo-disaster recovery. Similar caveats as Kafka for event sourcing.

Production Implementations

LMAX: 6 Million Orders/Second

Context: Financial exchange requiring ultra-low latency and complete audit trail.

Architecture:

Business Logic Processor (BLP): Single-threaded, in-memory, event-sourced
Disruptor: Lock-free ring buffer for I/O (25M+ messages/second, sub-50ns latency)
Replication: Three BLPs process all input events (two in primary DC, one DR)

Specific details:

Ring buffers: 20M slots (input), 4M slots (output each)
Nightly snapshots; BLPs restart nightly with zero downtime
Cryptographic operations offloaded from core logic
Validation split: state-independent checks before BLP, state-dependent in BLP

What worked: In-memory event sourcing eliminated database as bottleneck.

What was hard: Debugging distributed replay; ensuring determinism in business logic.

Source: Martin Fowler - LMAX Architecture

Netflix Downloads: Six-Month Build

Context: November 2016 feature launch requiring stateful service (download tracking, licensing) built in months.

Architecture:

Cassandra-backed event store
Three components: Event Store, Aggregate Repository, Aggregate Service
Async projections for query optimization

Specific details:

Horizontal scaling: only disk space was the constraint
Projections rebuilt during debugging/development
Full auditing enabled rapid issue diagnosis

What worked: Event sourcing flexibility enabled pivoting requirements during development.

What was hard: Rebuild times during development (“re-runs took DAYS”)—mitigated by snapshotting.

Source: InfoQ - Scaling Event Sourcing for Netflix Downloads

Uber Cadence: Durable Execution at Scale

Context: Workflow orchestration for hundreds of applications across Uber.

Architecture:

Event sourcing for workflow state (deterministic replay)
Components: Front End, History Service, Matching Service
Cassandra backend

Specific details:

Multitenant clusters with hundreds of domains each
Single service supports 100+ applications
Worker Goroutines reduced from 16K to 100 per history host (95%+ reduction)

Key insight: Workflows are event-sourced—state reconstructed by replaying historical events. Cadence (and its fork, Temporal) demonstrates event sourcing for durable execution, not just data storage.

Source: Uber Engineering - Cadence Overview

Implementation Comparison

Aspect	LMAX	Netflix Downloads	Uber Cadence
Domain	Financial exchange	Media licensing	Workflow orchestration
Events/second	6M orders	Not disclosed	Hundreds of domains
Consistency	Single-threaded	Eventual	Per-workflow
Storage	In-memory + disk	Cassandra	Cassandra
Snapshot strategy	Nightly	As needed	Per-workflow history
Team size	Small, specialized	6-month team	Platform team

Common Pitfalls

1. Unbounded Event Growth

The mistake: Not planning for event retention or archival.

Example: Financial app storing every price tick (millions/day). After one year, rebuilding account balances requires replaying 3TB of events. Queries take minutes; cold starts are impossible.

Solutions:

Snapshots: Every N events or time period
Archival: Move events older than N days to cold storage (S3, Glacier)
Temporal streams: Segment streams by time period (orders-2024-Q1)

2. Assuming Event Order Guarantees

The mistake: Building logic that assumes “event A always comes before event B.”

Reality (from production war stories): “Events are duplicated, delayed, reordered, partially processed, or replayed long after the context that created them has changed.”

Example: Projection assumes OrderPlaced arrives before OrderShipped. During backfill, events arrive out of order; projection crashes.

Solutions:

Treat each event as independent input
Build projections that handle any arrival order
Use event timestamps for ordering within projections, not arrival order

3. Dual-Write to Event Store and Downstream

The mistake: Writing event to store, then publishing to message queue in separate operations.

Example: appendEvent() succeeds, publishToKafka() fails. Event exists but downstream never sees it.

Solutions:

Transactional outbox: Write event + outbox entry in same transaction; separate process polls outbox
Change Data Capture (CDC): Database-level capture of event table changes
Event store with built-in pub/sub: EventStoreDB subscriptions

4. Projection Complexity Explosion

The mistake: Building projections with complex joins, aggregations, and cross-stream queries.

Example: “Order revenue by region by product category by month” projection requires joining order events, product catalog, and region data. Rebuild takes hours.

Solutions:

Denormalize aggressively at projection time
Accept data duplication in read models
Build multiple focused projections vs one complex one
Consider whether query belongs in analytics layer (data warehouse)

5. Schema Migration Nightmares

The mistake: Changing event schemas without migration strategy.

Example (from production): Team added discount_code field to OrderPlaced events. Old projections ignored it. Replay applied 2024 logic to 2022 events, giving customers unintended discounts.

Solutions:

Additive changes only: New optional fields with defaults
Versioned projections: Tag projections with compatible schema versions
Code version in events: Every event carries code version that created it
Upcasters: Transform old events to current schema on read

6. “Time Travel Debugging” Oversold

The mistake: Expecting event sourcing to make debugging trivial.

Reality (from Chris Kiehl’s experience): “99% of the time ‘bad states’ were bad events caused by standard run-of-the-mill human error. Having a ledger provided little value over normal debugging intuition.”

Practical questions:

How do you apply time-travel on production data?
How do you inspect intermediate states if events are in binary format?
How do you fix bugs for users already impacted?

Example: “Order stuck in pending for 3 hours” took 6 hours to debug—what would’ve been a 20-minute fix in CRUD required understanding the full event history.

Solution: Event sourcing provides audit capability, not magic debugging. Invest in:

Event visualization tooling
Projection debugging (show state at any event)
Compensating event workflows for fixing bad state

Event sourcing’s immutability conflicts with “right to be forgotten.”

Crypto Shredding Pattern

From Mathias Verraes:

How it works:

Store personal data encrypted in event store (not plaintext)
Store encryption key separately (different database/filesystem)
When user requests deletion, delete only the encryption key
Encrypted data becomes permanently unreadable

Implementation:


4 collapsed lines
1
// Events store only encrypted personal data
2
interface OrderPlacedEvent {
3
  orderId: string
4
  customerRef: string // Reference to customer, not PII
5
  encryptedCustomerData: string // AES-encrypted name, address
6
  items: OrderItem[]
7
}
8

9
// Key storage separate from event store
10
interface CustomerKeyStore {
11
  getKey(customerRef: string): Promise<CryptoKey | null>
12
  deleteKey(customerRef: string): Promise<void> // "Forget" customer
13
}
14

15
// Projection decrypts only if key exists
16
async function projectOrder(event: OrderPlacedEvent, keyStore: CustomerKeyStore): Promise<OrderReadModel> {
17
  const key = await keyStore.getKey(event.customerRef)
8 collapsed lines
18
  const customerData = key
19
    ? await decrypt(event.encryptedCustomerData, key)
20
    : { name: "[deleted]", address: "[deleted]" }
21

22
  return {
23
    orderId: event.orderId,
24
    customerName: customerData.name,
25
    // ...
26
  }
27
}

Benefits:

Event stream remains intact (immutable)
Deleting key effectively removes data from all backups
Existing events still queryable for non-personal data

Legal caveat: GDPR states encrypted personal information is still personal information. Some legal interpretations argue crypto shredding may not be fully compliant. Consult legal counsel.

Forgettable Payloads Alternative

Store personal data in separate database; events contain only references.

Trade-off: Requires joins at read time; personal data store becomes another system to maintain.

Conclusion

Event sourcing replaces mutable state with an immutable log of domain events. Current state is derived, not stored. This enables complete audit trails, temporal queries, and projection flexibility—at the cost of increased complexity, eventual consistency, and schema evolution challenges.

Key takeaways:

Not a silver bullet: CQRS/ES adds complexity. Use only for bounded contexts where audit trails or temporal queries justify the cost.
Design decisions multiply: Stream granularity, projection strategy, snapshot policy, and schema evolution must all be decided.
Production realities differ from theory: Real systems handle out-of-order events, duplicate delivery, projection lag, and schema migrations.
Start simple: No snapshots initially. Synchronous projections if consistency matters. Add complexity only when measured.
Kafka is not an event store: Purpose-built stores (EventStoreDB) or database-backed solutions (Marten, PostgreSQL) are better fits.

Appendix

Prerequisites

Understanding of domain-driven design (aggregates, bounded contexts)
Familiarity with distributed systems consistency models (eventual vs strong)
Basic understanding of CQRS (Command Query Responsibility Segregation)

Terminology

Term	Definition
Event	Immutable fact representing something that happened in the domain
Stream	Ordered sequence of events for an aggregate
Projection	Read model derived by applying events to initial state
Snapshot	Materialized state at a point in time; optimization for replay
Upcasting	Transforming old event schema to current schema on read
Aggregate	DDD concept; consistency boundary containing related entities
Left-fold	Reducing a sequence to a single value by applying a function cumulatively
Optimistic concurrency	Conflict detection by checking version hasn’t changed since read
Transactional outbox	Pattern for reliable event publishing within database transaction

Summary

Core pattern: State = replay(events). Events are immutable facts; current state is derived.
Design paths: Pure ES vs hybrid; sync vs async projections. Choose based on consistency and throughput requirements.
Snapshots: Add when replay time becomes problematic. Event-count or time-based triggers.
Schema evolution: Additive changes preferred. Upcasting for breaking changes. Never modify existing events in place.
Production reality: Handle out-of-order events, duplicates, projection lag, and unbounded growth.
GDPR: Crypto shredding pattern separates encryption keys from encrypted data.

References

Foundational Sources

Martin Fowler - Event Sourcing - Original definition and patterns
Martin Fowler - CQRS - CQRS relationship and warnings
Greg Young - CQRS Documents - Comprehensive CQRS/ES guide from the pattern’s creator
Microsoft - Exploring CQRS and Event Sourcing - Reference implementation with Greg Young foreword

Production Case Studies

Martin Fowler - LMAX Architecture - 6M orders/second event sourcing
Netflix Tech Blog - Scaling Event Sourcing for Downloads - Six-month build case study
Walmart Global Tech - Inventory Availability System - Hybrid event sourcing at scale
Uber Engineering - Cadence Overview - Event sourcing for durable execution

Implementation Patterns

Event-Driven.io - Projections and Read Models - Projection lifecycle patterns
Event-Driven.io - Event Versioning - Schema evolution strategies
Domain Centric - Snapshotting - Snapshot strategy decision framework
Marten Documentation - PostgreSQL-backed event sourcing for .NET

Failure Modes and Lessons

Dennis Doomen - The Ugly of Event Sourcing - Production war stories
Chris Kiehl - Event Sourcing is Hard - Debugging reality check
Serialized.io - Why Kafka is Not for Event Sourcing - Technology selection guidance

Mathias Verraes - Crypto Shredding Pattern - Data deletion strategy
Event-Driven.io - GDPR in Event-Driven Architecture - Compliance patterns

Tools and Frameworks

EventStoreDB Documentation - Purpose-built event store
LMAX Disruptor - High-performance ring buffer
Axon Framework - Java CQRS/ES toolkit (70M+ downloads)
Temporal.io - Durable execution platform (Cadence fork)

Read more