Core Distributed Patterns
21 min read

Event Sourcing

A deep-dive into event sourcing: understanding the core pattern, implementation variants, snapshot strategies, schema evolution, and production trade-offs across different architectures.

Rebuild

Read Side

Event Log

Command Side

Command

Validate

Event Store

Event 1

Event 2

Event ...

Event N

Projection 1

Projection 2

Projection N

Snapshot

Replay from Snapshot

Current State

Event sourcing architecture: commands produce events stored immutably; projections derive read models; state rebuilt by replaying events from snapshots.

Event sourcing stores application state as a sequence of immutable events rather than current state. The core insight: events are facts that happened—they cannot be deleted or modified, only appended. State is derived by replaying events (a “left fold” over the event stream).

Key mental model:

  • Write path: Commands → Validation → Event(s) → Append to stream
  • Read path: Subscribe to events → Project into read-optimized models
  • Recovery: Snapshot + Replay events since snapshot = Current state

Design decisions after choosing event sourcing:

  1. Stream granularity: Per-aggregate (common) vs shared streams (rare)
  2. Projection strategy: Synchronous (strong consistency) vs async (eventual)
  3. Snapshot policy: Event-count vs time-based vs state-triggered
  4. Schema evolution: Upcasting (transform on read) vs stream transformation (migrate data)

Event sourcing is not CQRS (Command Query Responsibility Segregation), but with event sourcing you must use CQRS—reads come from projections, not the event store.

Approach 1: Mutable state in a database (CRUD)

  • Fails because: Current state overwrites history. You cannot answer “what was the balance at 3pm yesterday?” without separate audit logging.
  • Example: A bank account shows $500. A customer disputes a charge. Without event history, proving what transactions occurred requires mining application logs—if they exist.

Approach 2: Adding audit tables alongside CRUD

  • Fails because: Audit tables and current state can diverge. The audit is a second-class citizen—queries hit current state, audits are bolted on.
  • Example: A bug updates the balance without writing to the audit table. Now the audit and state disagree. Which is correct?

Approach 3: Database triggers for audit logging

  • Fails because: Triggers capture “what changed,” not “why it changed.” You see “balance changed from 600 to 500” but not “customer withdrew $100 at ATM #4421.”
  • Example: Debugging why an order was cancelled. Trigger logs show “status changed to CANCELLED” but not “cancelled by customer due to shipping delay.”

The fundamental tension: optimized writes vs queryable history. CRUD optimizes for writes (overwrite current state) but sacrifices history. Audit logging preserves history but creates dual-write complexity and semantic loss.

Event sourcing exists to make events the single source of truth—current state becomes a derived view that can be rebuilt at any time.

Every state change is captured as an immutable event appended to an event stream. Current state is derived by replaying events from the beginning (or from a snapshot).

From Martin Fowler’s foundational definition:

“Capture all changes to an application state as a sequence of events.”

Events are structured facts:

3 collapsed lines
// Event structure
interface DomainEvent {
eventId: string // Unique identifier
streamId: string // Aggregate this event belongs to
eventType: string // e.g., "OrderPlaced", "ItemAdded"
data: unknown // Event-specific payload
metadata: {
timestamp: Date
version: number // Position in stream
causationId?: string // What caused this event
correlationId?: string // Request/saga identifier
}
}

State reconstruction follows a deterministic left-fold:

currentState = events.reduce(applyEvent, initialState)
  1. Append-only: Events are never updated or deleted. The event store is an immutable log.
  2. Deterministic replay: Given the same events in the same order, replay produces identical state.
  3. Event ordering: Events within a stream are totally ordered by version number.
  4. Idempotent projections: Projections must handle duplicate event delivery gracefully.
FailureImpactMitigation
Event store unavailableCommands blocked; reads from projections may continueMulti-region replication; read replicas
Projection lagStale read modelsMonitor lag; circuit breaker on staleness threshold
Event schema mismatchProjection crashes or produces incorrect stateSchema registry; upcasting; versioned projections
Unbounded event growthReplay becomes prohibitively slowSnapshots; event archival to cold storage
Concurrent writes to same streamOptimistic concurrency violationRetry with conflict resolution; smaller aggregates

When to choose this path:

  • Audit trail is a regulatory requirement (finance, healthcare)
  • Temporal queries are core functionality (“what was the state on date X?”)
  • Domain naturally expresses state changes as events
  • You need to rebuild projections for new query patterns

Key characteristics:

  • Events are the only source of truth
  • All read models are projections derived from events
  • No direct database writes outside the event store
  • State can be fully rebuilt from events at any time

Implementation approach:

5 collapsed lines
// Pure event sourcing: aggregate loaded from events
interface Aggregate<TState, TEvent> {
state: TState
version: number
}
// Command handler pattern
async function handle<TState, TEvent>(
command: Command,
loadAggregate: (id: string) => Promise<Aggregate<TState, TEvent>>,
decide: (state: TState, command: Command) => TEvent[],
appendEvents: (streamId: string, expectedVersion: number, events: TEvent[]) => Promise<void>,
): Promise<void> {
const aggregate = await loadAggregate(command.aggregateId)
11 collapsed lines
const newEvents = decide(aggregate.state, command)
await appendEvents(command.aggregateId, aggregate.version, newEvents)
}
// Loading aggregate from event store
async function loadAggregate<TState, TEvent>(
streamId: string,
eventStore: EventStore,
evolve: (state: TState, event: TEvent) => TState,
initialState: TState,
): Promise<Aggregate<TState, TEvent>> {
const events = await eventStore.readStream(streamId)
const state = events.reduce(evolve, initialState)
return { state, version: events.length }
}

Real-world example:

LMAX Exchange processes 6 million orders per second using pure event sourcing. Their Business Logic Processor runs entirely in-memory, single-threaded. Events are the recovery mechanism—on restart, they replay from last snapshot plus subsequent events to reconstruct state in milliseconds.

Key implementation details:

  • Ring buffer (Disruptor) handles I/O concurrency: 20 million slots for input
  • Latency: sub-50 nanosecond event processing (mean latency 3 orders of magnitude below queue-based approaches)
  • Nightly snapshots; BLP restarts every night with zero downtime via replication

Trade-offs vs other paths:

AspectPure Event SourcingHybrid (ES + CRUD)
Audit completenessFull history guaranteedPartial—only ES domains audited
Query flexibilityBuild any projection from eventsSome data may not be projectable
ComplexityHigher—must project all readsLower—CRUD for simple domains
Migration costVery high (all-or-nothing)Incremental
Team expertise requiredEvent-driven thinking throughoutCan isolate ES to specific bounded contexts

When to choose this path:

  • Some domains require audit trails; others don’t
  • Team has mixed experience with event-driven architecture
  • Migrating from existing CRUD system incrementally
  • Read-heavy workloads where projection lag is unacceptable for some data

Key characteristics:

  • Core business domains use event sourcing (orders, transactions, inventory)
  • Supporting domains use CRUD (user preferences, product catalog)
  • Clear boundaries between ES and CRUD contexts
  • Events may be published from CRUD systems for integration

Implementation approach:

3 collapsed lines
// Hybrid: Event-sourced orders, CRUD product catalog
// Order aggregate (event-sourced)
class OrderAggregate {
apply(event: OrderEvent): void {
switch (event.type) {
case "OrderPlaced":
this.status = "placed"
this.items = event.data.items
break
case "OrderShipped":
this.status = "shipped"
break
}
}
}
// Product service (CRUD with event publishing)
class ProductService {
async updatePrice(productId: string, newPrice: number): Promise<void> {
// CRUD update
await this.db.products.update(productId, { price: newPrice })
// Publish event for downstream consumers (not source of truth)
await this.eventBus.publish({ type: "PriceUpdated", productId, newPrice })
}
}

Real-world example:

Walmart’s Inventory Availability System uses hybrid event sourcing. The inventory domain is fully event-sourced to track every stock movement. Product catalog data (descriptions, images) remains in traditional databases. Events from inventory updates feed projections that power real-time availability queries for millions of customers.

Key details:

  • Events partitioned by product-node ID for scalability
  • Command processor validates business rules before event emission
  • Read and write sides use different tech stacks, scaled independently

Trade-offs vs pure ES:

AspectHybridPure ES
Migration pathGradual, bounded context by contextBig-bang or strangler pattern
Team adoptionEasier—ES expertise not required everywhereRequires organization-wide buy-in
ConsistencyMixed—ES domains eventual, CRUD immediateEventual consistency throughout
Operational complexityHigher—two paradigms to operateSingle paradigm (but complex)

When to choose this path:

  • Strong consistency between writes and reads required
  • Projection latency is unacceptable
  • Lower throughput acceptable for consistency guarantee
  • Single-node or low-scale deployments

Key characteristics:

  • Projections updated in same transaction as event append
  • Read model immediately reflects write
  • No eventual consistency concerns
  • Limited scalability—projection work blocks command processing

Implementation approach:

2 collapsed lines
// Synchronous projection: update read model in same transaction
async function handleCommand(command: PlaceOrderCommand): Promise<void> {
await db.transaction(async (tx) => {
// 1. Load aggregate
const order = await loadFromEvents(tx, command.orderId)
// 2. Execute business logic
const events = order.place(command.items)
// 3. Append events
await appendEvents(tx, command.orderId, events)
// 4. Update projection synchronously
for (const event of events) {
await updateOrderListProjection(tx, event)
await updateInventoryProjection(tx, event)
}
3 collapsed lines
})
// Transaction commits: events + projections atomically
}

Real-world example:

Marten (for .NET/PostgreSQL) supports inline projections that run within the same database transaction as event appends. This guarantees that when a command succeeds, the read model is immediately consistent.

From Marten documentation: inline projections are “run as part of the same transaction as the events being captured and updated in the underlying database as well.”

Trade-offs:

AspectSynchronous ProjectionsAsync Projections
ConsistencyStrong—reads reflect writes immediatelyEventual—lag between write and read
ThroughputLower—projection work blocks writesHigher—writes return without waiting
ScalabilityLimited by projection costProjections scale independently
Failure isolationProjection failure fails the writeWrite succeeds; projection can retry

When to choose this path:

  • High write throughput required
  • Eventual consistency acceptable
  • Projections are expensive (aggregations, joins)
  • Need to scale reads independently of writes

Key characteristics:

  • Events appended to store; command returns immediately
  • Background processes subscribe to event streams
  • Projections catch up asynchronously
  • Must handle out-of-order delivery, duplicates, and lag

Implementation approach:

4 collapsed lines
// Async projection with checkpointing
interface ProjectionState {
lastProcessedPosition: number
// projection-specific state
}
async function runProjection(
subscription: EventSubscription,
project: (state: ProjectionState, event: DomainEvent) => ProjectionState,
checkpoint: (position: number) => Promise<void>,
): Promise<void> {
let state = await loadProjectionState()
for await (const event of subscription.fromPosition(state.lastProcessedPosition)) {
state = project(state, event)
// Checkpoint periodically (not every event—batching for performance)
if (event.position % 100 === 0) {
await checkpoint(event.position)
9 collapsed lines
}
}
}
// Idempotent projection handling duplicates
function projectOrderPlaced(state: OrderListState, event: OrderPlacedEvent): OrderListState {
// Skip if already processed (idempotency)
if (state.processedEventIds.has(event.eventId)) {
return state
}
return {
...state,
orders: [...state.orders, { id: event.data.orderId, status: "placed" }],
processedEventIds: state.processedEventIds.add(event.eventId),
}
}

Real-world example:

Netflix’s Downloads feature (launched November 2016) uses Cassandra-backed event sourcing with async projections. The team built this in six months, transforming from a stateless to stateful service.

Key implementation details:

  • Three components: Event Store, Aggregate Repository, Aggregate Service
  • Projections rebuild on-demand for debugging
  • Horizontal scaling limited only by disk space

Trade-offs:

AspectAsync ProjectionsSync Projections
Write latencyLow—return immediatelyHigher—wait for projections
Read freshnessEventual (seconds to minutes lag)Immediate
Failure isolationYes—projection failures don’t block writesNo—projection failure fails command
ComplexityHigher—handle duplicates, ordering, lagLower—transactional guarantees

No

Yes

Entire

Part

Strong

Eventual OK

High

Moderate

Need Event Sourcing?

Audit/temporal queries critical?

Consider CRUD

Entire system or bounded context?

Pure ES

Hybrid ES + CRUD

Consistency requirement?

Sync Projections

Write throughput?

Async Projections

Sync or Async based on complexity

Replaying thousands of events for every command becomes prohibitive. Snapshots store materialized state at a point in time, enabling replay from snapshot + recent events.

From the EventStoreDB team: “Do not add snapshots right away. They are not needed at the beginning.”

Add snapshots when:

  • Aggregate event counts exceed hundreds (measure first)
  • Load times impact user experience or SLAs
  • Cold start times are unacceptable
StrategyTriggerProsCons
Event-countEvery N events (e.g., 100)Predictable load timesMay snapshot unnecessarily
Time-basedEvery N hours/daysOperational simplicityVariable event counts between snapshots
State-triggeredOn significant transitions (draft→published)Snapshots at natural boundariesRequires domain knowledge
On-demandWhen load time exceeds thresholdOnly when neededFirst slow load before snapshot exists
5 collapsed lines
// Snapshot-aware aggregate loading
interface Snapshot<TState> {
state: TState
version: number
schemaVersion: number // For schema evolution
}
async function loadWithSnapshot<TState, TEvent>(
streamId: string,
eventStore: EventStore,
snapshotStore: SnapshotStore,
evolve: (state: TState, event: TEvent) => TState,
initialState: TState,
): Promise<{ state: TState; version: number }> {
// Try loading snapshot first
const snapshot = await snapshotStore.load<TState>(streamId)
const startVersion = snapshot?.version ?? 0
const startState = snapshot?.state ?? initialState
// Replay only events after snapshot
9 collapsed lines
const events = await eventStore.readStream(streamId, { fromVersion: startVersion + 1 })
const state = events.reduce(evolve, startState)
return { state, version: startVersion + events.length }
}
// Snapshot after command if threshold exceeded
async function maybeSnapshot<TState>(
streamId: string,
state: TState,
version: number,
snapshotStore: SnapshotStore,
threshold: number = 100,
): Promise<void> {
const lastSnapshot = await snapshotStore.getVersion(streamId)
if (version - lastSnapshot > threshold) {
await snapshotStore.save(streamId, { state, version, schemaVersion: 1 })
}
}

When snapshot schema changes:

  1. Increment schemaVersion in new snapshots
  2. On load, check schemaVersion
  3. If outdated, discard snapshot and rebuild from events

This is simpler than migrating snapshots—events are the source of truth; snapshots are ephemeral optimization.

Events are immutable and stored forever. Schema changes must be backward-compatible or handled via transformation.

From Greg Young (EventStoreDB creator):

“A new version of an event must be convertible from the old version. If not, it is not a new version but a new event.”

Add optional fields; never remove or rename.

3 collapsed lines
// Version 1
interface OrderPlacedV1 {
orderId: string
customerId: string
items: OrderItem[]
}
// Version 2: Added optional field
interface OrderPlacedV2 {
orderId: string
customerId: string
items: OrderItem[]
discountCode?: string // New optional field
}
// Projection handles both
function projectOrder(event: OrderPlacedV1 | OrderPlacedV2): Order {
return {
id: event.orderId,
customerId: event.customerId,
items: event.items,
discountCode: "discountCode" in event ? event.discountCode : undefined,
}
}

Transform old events to new schema when reading.

4 collapsed lines
// Upcaster transforms old schema to current schema
type Upcaster<TOld, TNew> = (old: TOld) => TNew
const orderPlacedUpcasters: Map<number, Upcaster<unknown, OrderPlacedV3>> = new Map([
[
1,
(v1: OrderPlacedV1) => ({
...v1,
discountCode: undefined,
source: "unknown", // Default for old events
}),
],
[
2,
(v2: OrderPlacedV2) => ({
...v2,
source: "unknown",
5 collapsed lines
}),
],
])
function upcast(event: StoredEvent): DomainEvent {
const upcaster = orderPlacedUpcasters.get(event.schemaVersion)
return upcaster ? upcaster(event.data) : event.data
}

When to use: Schema changes are frequent; you want projections to only handle the latest version.

Trade-off: CPU cost on every read; upcaster chain grows over time.

Rewrite the event stream with transformed events during a release window.

When to use: Breaking changes that upcasting cannot handle; reducing technical debt from long upcaster chains.

Trade-off: Requires downtime or careful blue-green deployment; must preserve event IDs and ordering.

Warning from Greg Young: “Updating an existing event can cause large problems.” Prefer stream transformations (copy with transform) over in-place modification.

For production systems, use a schema registry:

  • Store event schemas with versions
  • Validate events against schema on write
  • Reject events that don’t conform
  • Generate types from schemas (Avro, Protobuf, JSON Schema)

Projections transform event streams into query-optimized read models.

TypeExecutionConsistencyUse Case
InlineSame transaction as eventsStrongCritical reads; simple projections
AsyncBackground processEventualComplex aggregations; high throughput
LiveOn-demand, not persistedNone (computed)One-time analytics; debugging
6 collapsed lines
// Projection as left-fold over events
interface Projection<TState> {
initialState: TState
apply: (state: TState, event: DomainEvent) => TState
}
// Order list projection for dashboard
const orderListProjection: Projection<OrderListState> = {
initialState: { orders: [], totalRevenue: 0 },
apply(state, event) {
switch (event.type) {
case "OrderPlaced":
return {
...state,
orders: [...state.orders, { id: event.data.orderId, status: "placed", total: event.data.total }],
totalRevenue: state.totalRevenue + event.data.total,
}
case "OrderShipped":
return {
...state,
orders: state.orders.map((o) => (o.id === event.data.orderId ? { ...o, status: "shipped" } : o)),
}
case "OrderRefunded":
return {
...state,
orders: state.orders.map((o) => (o.id === event.data.orderId ? { ...o, status: "refunded" } : o)),
totalRevenue: state.totalRevenue - event.data.refundAmount,
}
default:
return state
}
},
}

Since events are source of truth, projections can be rebuilt at any time:

  1. Drop existing projection state
  2. Replay all events through projection logic
  3. New projection reflects current business logic

Use cases:

  • Bug fix in projection logic
  • New read model for new query pattern
  • Schema change in read database

Warning: Rebuilding from millions of events takes time. For Netflix Downloads, “re-runs took DAYS to complete” during development—mitigated by snapshots and event archival.

Problem: Projection A depends on Projection B’s state. During replay, B may not be at the same position as A.

From Dennis Doomen’s production experience:

“Rebuilding projection after schema upgrade caused projector to read from another projection that was at a different state in the event store’s history.”

Solutions:

  1. Single-pass projections: Project to denormalized models; avoid joins at projection time
  2. Explicit dependencies: Projections declare dependencies; rebuild engine ensures correct ordering
  3. Position tracking: Each projection tracks which events it has seen; dependent projections wait

Purpose-built for event sourcing.

Performance characteristics:

  • 15,000+ writes/second
  • 50,000+ reads/second
  • Quorum-based replication (majority must acknowledge)

Strengths:

  • Native event sourcing primitives (streams, subscriptions, projections)
  • Optimistic concurrency per stream
  • Persistent subscriptions with checkpointing
  • Built-in projections in JavaScript

Limitations:

  • No built-in sharding (must implement at application level)
  • Performance bounded by disk I/O
  • Read-only replicas require cluster connection

Designed for event streaming, not event sourcing.

Why Kafka is problematic for event sourcing (from Serialized.io):

  1. Topic scalability: Not designed for millions of topics (aggregates can easily reach millions)
  2. Entity loading: No practical way to load events for specific entity by ID within a topic
  3. No optimistic concurrency: Cannot do “save this event only if version is still X”
  4. Compaction destroys history: Log compaction keeps only latest value per key—antithetical to event sourcing

Where Kafka fits: Transport layer for events to downstream consumers. Use dedicated event store for source of truth, Kafka for integration.

Use relational database as event store.

Implementation patterns:

  • Transactional outbox: Events + state change in same transaction
  • LISTEN/NOTIFY: PostgreSQL async notifications for real-time subscriptions
  • Advisory locks: Application-level locking for projection synchronization

Libraries: Marten (.NET), pg-event-store (Node.js), Eventide (Ruby)

Trade-offs:

AspectPostgreSQLEventStoreDB
Operational familiarityHigh (existing expertise)New technology to learn
Event sourcing primitivesMust build or use libraryNative
TransactionsFull ACID with application dataEvent store separate from app DB
ScalingTraditional DB scalingPurpose-built but limited sharding

AWS EventBridge: Event routing and orchestration; not an event store. 90+ AWS service integrations. Use for serverless event-driven architectures, not as source of truth.

Azure Event Hubs: High-volume event ingestion (millions/second). Kafka-compatible. Geo-disaster recovery. Similar caveats as Kafka for event sourcing.

Context: Financial exchange requiring ultra-low latency and complete audit trail.

Architecture:

  • Business Logic Processor (BLP): Single-threaded, in-memory, event-sourced
  • Disruptor: Lock-free ring buffer for I/O (25M+ messages/second, sub-50ns latency)
  • Replication: Three BLPs process all input events (two in primary DC, one DR)

Specific details:

  • Ring buffers: 20M slots (input), 4M slots (output each)
  • Nightly snapshots; BLPs restart nightly with zero downtime
  • Cryptographic operations offloaded from core logic
  • Validation split: state-independent checks before BLP, state-dependent in BLP

What worked: In-memory event sourcing eliminated database as bottleneck.

What was hard: Debugging distributed replay; ensuring determinism in business logic.

Source: Martin Fowler - LMAX Architecture

Context: November 2016 feature launch requiring stateful service (download tracking, licensing) built in months.

Architecture:

  • Cassandra-backed event store
  • Three components: Event Store, Aggregate Repository, Aggregate Service
  • Async projections for query optimization

Specific details:

  • Horizontal scaling: only disk space was the constraint
  • Projections rebuilt during debugging/development
  • Full auditing enabled rapid issue diagnosis

What worked: Event sourcing flexibility enabled pivoting requirements during development.

What was hard: Rebuild times during development (“re-runs took DAYS”)—mitigated by snapshotting.

Source: InfoQ - Scaling Event Sourcing for Netflix Downloads

Context: Workflow orchestration for hundreds of applications across Uber.

Architecture:

  • Event sourcing for workflow state (deterministic replay)
  • Components: Front End, History Service, Matching Service
  • Cassandra backend

Specific details:

  • Multitenant clusters with hundreds of domains each
  • Single service supports 100+ applications
  • Worker Goroutines reduced from 16K to 100 per history host (95%+ reduction)

Key insight: Workflows are event-sourced—state reconstructed by replaying historical events. Cadence (and its fork, Temporal) demonstrates event sourcing for durable execution, not just data storage.

Source: Uber Engineering - Cadence Overview

AspectLMAXNetflix DownloadsUber Cadence
DomainFinancial exchangeMedia licensingWorkflow orchestration
Events/second6M ordersNot disclosedHundreds of domains
ConsistencySingle-threadedEventualPer-workflow
StorageIn-memory + diskCassandraCassandra
Snapshot strategyNightlyAs neededPer-workflow history
Team sizeSmall, specialized6-month teamPlatform team

The mistake: Not planning for event retention or archival.

Example: Financial app storing every price tick (millions/day). After one year, rebuilding account balances requires replaying 3TB of events. Queries take minutes; cold starts are impossible.

Solutions:

  • Snapshots: Every N events or time period
  • Archival: Move events older than N days to cold storage (S3, Glacier)
  • Temporal streams: Segment streams by time period (orders-2024-Q1)

The mistake: Building logic that assumes “event A always comes before event B.”

Reality (from production war stories): “Events are duplicated, delayed, reordered, partially processed, or replayed long after the context that created them has changed.”

Example: Projection assumes OrderPlaced arrives before OrderShipped. During backfill, events arrive out of order; projection crashes.

Solutions:

  • Treat each event as independent input
  • Build projections that handle any arrival order
  • Use event timestamps for ordering within projections, not arrival order

The mistake: Writing event to store, then publishing to message queue in separate operations.

Example: appendEvent() succeeds, publishToKafka() fails. Event exists but downstream never sees it.

Solutions:

  • Transactional outbox: Write event + outbox entry in same transaction; separate process polls outbox
  • Change Data Capture (CDC): Database-level capture of event table changes
  • Event store with built-in pub/sub: EventStoreDB subscriptions

The mistake: Building projections with complex joins, aggregations, and cross-stream queries.

Example: “Order revenue by region by product category by month” projection requires joining order events, product catalog, and region data. Rebuild takes hours.

Solutions:

  • Denormalize aggressively at projection time
  • Accept data duplication in read models
  • Build multiple focused projections vs one complex one
  • Consider whether query belongs in analytics layer (data warehouse)

The mistake: Changing event schemas without migration strategy.

Example (from production): Team added discount_code field to OrderPlaced events. Old projections ignored it. Replay applied 2024 logic to 2022 events, giving customers unintended discounts.

Solutions:

  • Additive changes only: New optional fields with defaults
  • Versioned projections: Tag projections with compatible schema versions
  • Code version in events: Every event carries code version that created it
  • Upcasters: Transform old events to current schema on read

The mistake: Expecting event sourcing to make debugging trivial.

Reality (from Chris Kiehl’s experience): “99% of the time ‘bad states’ were bad events caused by standard run-of-the-mill human error. Having a ledger provided little value over normal debugging intuition.”

Practical questions:

  • How do you apply time-travel on production data?
  • How do you inspect intermediate states if events are in binary format?
  • How do you fix bugs for users already impacted?

Example: “Order stuck in pending for 3 hours” took 6 hours to debug—what would’ve been a 20-minute fix in CRUD required understanding the full event history.

Solution: Event sourcing provides audit capability, not magic debugging. Invest in:

  • Event visualization tooling
  • Projection debugging (show state at any event)
  • Compensating event workflows for fixing bad state

Event sourcing’s immutability conflicts with “right to be forgotten.”

From Mathias Verraes:

How it works:

  1. Store personal data encrypted in event store (not plaintext)
  2. Store encryption key separately (different database/filesystem)
  3. When user requests deletion, delete only the encryption key
  4. Encrypted data becomes permanently unreadable

Implementation:

4 collapsed lines
// Events store only encrypted personal data
interface OrderPlacedEvent {
orderId: string
customerRef: string // Reference to customer, not PII
encryptedCustomerData: string // AES-encrypted name, address
items: OrderItem[]
}
// Key storage separate from event store
interface CustomerKeyStore {
getKey(customerRef: string): Promise<CryptoKey | null>
deleteKey(customerRef: string): Promise<void> // "Forget" customer
}
// Projection decrypts only if key exists
async function projectOrder(event: OrderPlacedEvent, keyStore: CustomerKeyStore): Promise<OrderReadModel> {
const key = await keyStore.getKey(event.customerRef)
8 collapsed lines
const customerData = key
? await decrypt(event.encryptedCustomerData, key)
: { name: "[deleted]", address: "[deleted]" }
return {
orderId: event.orderId,
customerName: customerData.name,
// ...
}
}

Benefits:

  • Event stream remains intact (immutable)
  • Deleting key effectively removes data from all backups
  • Existing events still queryable for non-personal data

Legal caveat: GDPR states encrypted personal information is still personal information. Some legal interpretations argue crypto shredding may not be fully compliant. Consult legal counsel.

Store personal data in separate database; events contain only references.

Trade-off: Requires joins at read time; personal data store becomes another system to maintain.

Event sourcing replaces mutable state with an immutable log of domain events. Current state is derived, not stored. This enables complete audit trails, temporal queries, and projection flexibility—at the cost of increased complexity, eventual consistency, and schema evolution challenges.

Key takeaways:

  1. Not a silver bullet: CQRS/ES adds complexity. Use only for bounded contexts where audit trails or temporal queries justify the cost.
  2. Design decisions multiply: Stream granularity, projection strategy, snapshot policy, and schema evolution must all be decided.
  3. Production realities differ from theory: Real systems handle out-of-order events, duplicate delivery, projection lag, and schema migrations.
  4. Start simple: No snapshots initially. Synchronous projections if consistency matters. Add complexity only when measured.
  5. Kafka is not an event store: Purpose-built stores (EventStoreDB) or database-backed solutions (Marten, PostgreSQL) are better fits.
  • Understanding of domain-driven design (aggregates, bounded contexts)
  • Familiarity with distributed systems consistency models (eventual vs strong)
  • Basic understanding of CQRS (Command Query Responsibility Segregation)
TermDefinition
EventImmutable fact representing something that happened in the domain
StreamOrdered sequence of events for an aggregate
ProjectionRead model derived by applying events to initial state
SnapshotMaterialized state at a point in time; optimization for replay
UpcastingTransforming old event schema to current schema on read
AggregateDDD concept; consistency boundary containing related entities
Left-foldReducing a sequence to a single value by applying a function cumulatively
Optimistic concurrencyConflict detection by checking version hasn’t changed since read
Transactional outboxPattern for reliable event publishing within database transaction
  • Core pattern: State = replay(events). Events are immutable facts; current state is derived.
  • Design paths: Pure ES vs hybrid; sync vs async projections. Choose based on consistency and throughput requirements.
  • Snapshots: Add when replay time becomes problematic. Event-count or time-based triggers.
  • Schema evolution: Additive changes preferred. Upcasting for breaking changes. Never modify existing events in place.
  • Production reality: Handle out-of-order events, duplicates, projection lag, and unbounded growth.
  • GDPR: Crypto shredding pattern separates encryption keys from encrypted data.

Read more

  • Previous

    Exactly-Once Delivery

    System Design / Core Distributed Patterns 23 min read

    True exactly-once delivery is impossible in distributed systems—the Two Generals Problem (1975) and FLP impossibility theorem (1985) prove this mathematically. What we call “exactly-once” is actually “effectively exactly-once”: at-least-once delivery combined with idempotency and deduplication mechanisms that ensure each message’s effect occurs exactly once, even when the message itself is delivered multiple times.

  • Next

    Change Data Capture

    System Design / Core Distributed Patterns 20 min read

    Change Data Capture (CDC) extracts and streams database changes to downstream systems in real-time. Rather than polling databases or maintaining dual-write logic, CDC reads directly from the database’s internal change mechanisms—transaction logs, replication streams, or triggers—providing a reliable, non-invasive way to propagate data changes across systems.This article covers CDC approaches, log-based implementation internals, production patterns, and when each variant makes sense.