Design a Time Series Database

A system design for a metrics-and-monitoring time-series database (TSDB) that has to absorb millions of samples per second, answer dashboard queries in tens of milliseconds, keep months of history at affordable cost, and survive the cardinality explosions that come with modern dimensional telemetry. The design lands on an LSM-style block engine modeled after Prometheus and InfluxDB’s TSM engine, with Gorilla-style compression inside the chunks and a horizontally fanned query layer in the spirit of M3 and Cortex.

TSDB architecture overview: agents push to a stateless distributor, ingesters write a WAL and a memtable, a compaction service merges blocks into TSM/index files, the query engine fans out across the index and chunks, and the lifecycle service handles downsampling, retention, and migration to object storage. — End-to-end TSDB: ingest, write path, storage engine, query path, and lifecycle.

Abstract

Time-series databases optimize for an append-only workload with time-ordered reads. Samples arrive in near-monotonic timestamp order, are written once, and are read back as range scans grouped by labels. Exploiting that pattern is what unlocks ~10x compression and the order-of-magnitude query gap over general-purpose engines.

The storage engine is an LSM-style block store. Writes hit a Write-Ahead Log (WAL) for durability, accumulate in an in-memory head, then flush to immutable, time-partitioned blocks. Prometheus crystallized the canonical layout: 2-hour blocks, 128 MB WAL segments, and background compaction up to 31 days or 10% of retention, whichever is smaller¹.

Compression lives inside the chunks. Facebook’s Gorilla algorithm uses delta-of-delta encoding for timestamps and XOR-encoded floats for values; on Facebook’s production workload it averaged ~1.37 bytes/sample, a ~12x reduction over a naive 16-byte (timestamp + double) layout, with ~96% of timestamp delta-of-deltas collapsing to a single bit because samples arrive at regular scrape intervals.

Cardinality — the number of unique (metric name, label set) series — is the primary scaling constraint. A single metric like http_requests{method, status, route} can explode into millions of series when a label is unbounded (request id, raw URL, user id). The inverted index has to map label terms to series IDs efficiently while staying inside a known memory budget, and the system has to refuse cardinality bombs at the door rather than silently OOM the head block.

Requirements

Functional Requirements

Feature	Priority	Scope
Metric ingestion (push model)	Core	Full
Range queries by time	Core	Full
Tag-based filtering	Core	Full
Aggregations (sum, avg, max, p99)	Core	Full
Downsampling and rollups	Core	Full
Retention policies	Core	Full
Distributed queries	Core	Full
Cardinality management	Core	Full
Alerting integration	High	Overview
Multi-tenancy	High	Overview
Exemplar storage (traces)	Medium	Brief
Anomaly detection	Low	Out of scope

Non-Functional Requirements

Requirement	Target	Rationale
Write throughput	1M+ samples/sec per node	Modern infrastructure generates massive telemetry
Read latency (hot data)	p99 < 100 ms	Dashboard refresh and alerting requirements
Read latency (cold data)	p99 < 5 s	Historical analysis is acceptable with more delay
Storage efficiency	< 2 bytes/sample compressed	Cost-effective retention of years of data
Availability	99.9%	Monitoring must survive partial production outages
Data durability	99.999%	Metrics are forensic evidence after incidents
Retention	Raw: 15 days, Aggregated: 2 years	Balance granularity vs. storage cost

Scale Estimation

Metric Sources

Hosts: 100K servers, containers, and VMs
Metrics per host: 500 (CPU, memory, disk, network, application)
Collection interval: 10 seconds

Traffic

Samples/sec:
Daily samples: per day
Peak multiplier: 2x during deployments → ~10M samples/sec burst

Storage (raw)

Sample size: 16 bytes (8-byte timestamp + 8-byte double)
Daily raw: TB/day
With Gorilla-class compression (~12x): ~575 GB/day
15-day raw retention: ~8.6 TB
2-year 1-minute downsampled rollup: ~2 TB

Cardinality

Active series (metric + tags): ~50M
Series per host (avg): ~500
High-cardinality labels (per-request id, per-user id, raw URL): must be blocked or pre-aggregated

Design Paths

Path A: In-memory with disk spillover (Gorilla / Atlas style)

Best when: the most valuable data is the last few hours, the memory budget is bounded, and you need sub-millisecond reads on hot data. This is what Facebook described in the original Gorilla paper, and similar in spirit to Netflix Atlas.

Architecture:

All active series in memory, each backed by a fixed-size ring buffer
Disk used only for overflow and crash recovery
Cold tier handed off to a separate system

Trade-offs:

Sub-millisecond query latency for recent windows
Predictable memory usage per series
Limited retention (typically hours to a couple of days)
Memory cost scales linearly with cardinality
Cold queries require a second stack

Real-world example: Gorilla holds the most recent 26 hours in memory at Facebook². Atlas keeps a rolling window for live operational dashboards.

Path B: LSM-style with tiered storage (Prometheus / InfluxDB)

Best when: retention has to stretch from weeks to months, queries hit a mix of hot and cold ranges, and storage cost matters.

Architecture:

Write-ahead log for durability
In-memory head block with a 2-hour window
Background compaction to immutable, time-partitioned blocks
Object storage for cold blocks

Trade-offs:

Cost-effective long retention
Strong compression (Gorilla-class chunks reach ~1.37 bytes/sample²)
Handles variable cardinality within a known budget
Cold-tier query latency is higher
Background compaction adds CPU and write amplification

Real-world example: Prometheus uses 2-hour head blocks and compacts upward to 31 days¹. InfluxDB’s TSM engine groups data into shards whose duration scales with retention — 7 days for the default infinite-retention autogen policy, smaller for shorter retentions³.

Path C: Distributed LSM with global index (M3 / Cortex / Mimir)

Best when: a single node cannot keep up, you need multi-tenant isolation, or queries need to federate across regions.

Architecture:

Distributors hash by series ID onto a ring of stateful ingesters
Per-tenant isolated WAL, head, and index
Query frontends fan out, then merge and aggregate
Centralized or per-tenant indexing

Trade-offs:

Horizontal scale into the billions of samples/sec
Multi-tenant isolation
Geographic distribution becomes feasible
Operational complexity (rings, replication, rebalancing)
Network overhead per query
Consistency surprises during scale-up / scale-down

Real-world example: Uber’s M3 is documented as ingesting over a billion datapoints per second in production, with internal Uber posts citing ~8.5B datapoints/sec at the query layer in 2018⁴. Cortex and its successor Grafana Mimir provide multi-tenant, Prometheus-compatible storage along the same lines.

Path Comparison

Factor	Path A (In-Memory)	Path B (LSM)	Path C (Distributed)
Query latency (hot)	Sub-ms	10-100 ms	50-500 ms
Retention	Hours	Weeks-Months	Years
Storage efficiency	Compressed in RAM (~1.37 bytes/sample)	Same compression on disk + tiering	Same plus per-tenant tiers
Cardinality limit	RAM-bound	Disk + index-bound	Per-tenant configurable
Operational complexity	Low	Medium	High
Best for	Real-time dashboards	Single-cluster monitoring	Multi-tenant SaaS

This Article’s Focus

The rest of this article implements Path B (LSM-style) as the core, with the distributor / query-fanout extensions from Path C. That matches the shape of most production deployments today (Prometheus, InfluxDB, VictoriaMetrics, Mimir) and gives a foundation for both single-node and distributed reasoning.

High-Level Design

Service Architecture

Ingestion Service (Distributor)

Receives metrics from agents and routes to storage nodes:

Parse and validate the wire format (Prometheus remote-write, InfluxDB line protocol, OpenTelemetry OTLP)
Extract metric name, labels, timestamp, value
Hash the canonical series ID (metric name + sorted labels) for consistent routing
Forward to one or more replica ingesters
Apply backpressure via load shedding

Storage Engine (Ingester)

Owns the local write path and durable state:

Append every batch to the Write-Ahead Log (WAL) before acknowledging
Buffer samples in an in-memory head per series
Maintain an in-memory inverted index for active series
Flush head to immutable blocks every 2 hours
Trigger compaction to merge small blocks into larger ones

Compaction Service

Background process that keeps the on-disk shape sane:

Merge small blocks into larger ones
Re-encode chunks with the chosen compression
Drop deleted or expired data
Rebuild indexes for the merged block
Enforce retention by deleting blocks past the cutoff

Query Engine

Executes queries across hot and cold tiers:

Parse the query language (PromQL, InfluxQL, SQL)
Plan execution: which blocks must be scanned given the time range
Resolve label selectors via the inverted index
Fetch and decompress chunks, then aggregate
Return results with metadata (resolution, source blocks, exemplars)

Lifecycle Manager

Owns retention and downsampling:

Enforce per-tenant time-based retention
Run downsampling jobs (raw → 1-minute → 1-hour aggregates)
Move blocks across tiers (hot → warm → cold)
Track storage usage per tenant for quota and billing

Data Flow: Write Path

Write path sequence: agent posts a batch to a distributor, which validates and hashes the series, then routes to one or more ingesters; each ingester appends to its WAL, acknowledges, then buffers in the memtable; after the head window, the memtable flushes to a compressed, indexed block. — Write path: distributor fan-out, WAL append, memtable buffer, periodic block flush.

Data Flow: Query Path

Query path sequence: client sends a range query, the query engine checks a query cache, then resolves matching series via the inverted index, fetches compressed chunks, decompresses and aggregates, and stores the result in the cache before responding. — Query path: cache lookup, label-to-series resolution, chunk fetch, and aggregation.

API Design

Write Endpoint

Endpoint: POST /api/v1/write

Accepts metrics in Prometheus remote-write (Snappy-compressed protobuf) or InfluxDB line protocol:

1# Prometheus exposition format (text)2http_requests_total{method="GET",status="200"} 1234 17040672000003http_requests_total{method="POST",status="500"} 5 17040672000004cpu_usage{host="server-1",core="0"} 0.75 170406720000056# InfluxDB line protocol7http_requests,method=GET,status=200 total=1234 17040672000000000008cpu,host=server-1,core=0 usage=0.75 1704067200000000000

Request (Prometheus remote-write protobuf):

1message WriteRequest {2  repeated TimeSeries timeseries = 1;3  repeated MetricMetadata metadata = 2;4}56message TimeSeries {7  repeated Label labels = 1;  // [{name: "__name__", value: "http_requests_total"}, ...]8  repeated Sample samples = 2;9}1011message Sample {12  double value = 1;13  int64 timestamp = 2;  // Unix ms14}

Response:

204 No Content: success
400 Bad Request: invalid format, missing required labels
429 Too Many Requests: rate-limited (backpressure)
503 Service Unavailable: ingester overloaded

Design Decision: Push vs. Pull

Prometheus famously uses a pull / scrape model, while most TSDBs accept push. For a general-purpose TSDB:

Push (chosen here): simpler client integration, traverses firewalls naturally, scales with stateless distributors.
Pull: better for service discovery, guarantees scrape intervals, but requires bidirectional connectivity to every target.

Remote-write enables a hybrid: a local Prometheus scrapes nearby targets and pushes to a central, multi-tenant TSDB.

Query Endpoint

Endpoint: GET /api/v1/query_range

Query Parameters:

Parameter	Type	Description
`query`	string	Query expression (PromQL, InfluxQL)
`start`	int64	Start timestamp (Unix seconds or RFC3339)
`end`	int64	End timestamp
`step`	duration	Query resolution (e.g., “1m”, “5m”)

Example Query:

1GET /api/v1/query_range?query=rate(http_requests_total{status="500"}[5m])&start=1704067200&end=1704153600&step=60

Response:

1{2  "status": "success",3  "data": {4    "resultType": "matrix",5    "result": [6      {7        "metric": {8          "__name__": "http_requests_total",9          "method": "GET",10          "status": "500",11          "instance": "server-1:9090"12        },13        "values": [14          [1704067200, "0.5"],15          [1704067260, "0.7"],16          [1704067320, "0.3"]17        ]18      }19    ]20  }21}

Rate Limits: 100 queries/min per tenant, 10 concurrent queries.

Label Endpoints

Get label names: GET /api/v1/labels

Get label values: GET /api/v1/label/{label_name}/values

These query the inverted index and are essential for dashboard tools that build dynamic dropdowns.

Series Metadata

Endpoint: GET /api/v1/series

Returns series matching a label selector:

1GET /api/v1/series?match[]=http_requests_total{status=~"5.."}

Used for cardinality analysis and debugging high-cardinality labels.

Data Modeling

Sample Format

Primary unit: a sample is a (timestamp, value) pair attached to a specific series.

1interface Sample {2  timestamp: number // Unix milliseconds3  value: number // IEEE 754 double (64-bit)4}56interface Series {7  labels: Map<string, string> // Metric name + tags8  samples: Sample[]9}1011// Series identifier (sorted label pairs)12// http_requests_total{method="GET",status="200"}13// → __name__=http_requests_total,method=GET,status=20014// → Hash: fnv64a("__name__\xff http_requests_total\xff method\xff GET\xff ...")

Design Decision: Label Ordering

Labels are sorted alphabetically before hashing. This ensures the same logical series always hashes to the same ID regardless of label order in the write request — critical for both deduplication on the distributor and consistent routing across replicas.

Block Format

A block stores a contiguous, immutable time window (2 hours by default in Prometheus¹):

1block-<ulid>/2├── meta.json           # Block metadata3├── index               # Inverted index (labels → series)4├── chunks/5│   ├── 000001          # Compressed chunk file6│   ├── 0000027│   └── ...8└── tombstones          # Deleted-series markers

meta.json:

1{2  "ulid": "01HQGJ5P1XXXXXXXXX",3  "minTime": 1704067200000,4  "maxTime": 1704074400000,5  "stats": {6    "numSamples": 50000000,7    "numSeries": 100000,8    "numChunks": 1500009  },10  "compaction": {11    "level": 1,12    "sources": ["01HQGJ..."]13  },14  "version": 115}

Index Structure

The on-disk index is an inverted index keyed by (label name, label value):

1Label → Series IDs2──────────────────3method=GET    → [1, 3, 5, 7, 9, ...]4method=POST   → [2, 4, 6, 8, ...]5status=200    → [1, 2, 3, 4, 5, ...]6status=500    → [6, 7, 8, ...]78Series ID → Chunk Refs9──────────────────────101 → [(chunk_001, offset=0, len=4096), (chunk_002, offset=0, len=2048)]112 → [(chunk_001, offset=4096, len=3072)]

Posting list intersection for queries:

1http_requests{method="GET",status="500"}231. Lookup method=GET → [1, 3, 5, 7, 9]42. Lookup status=500 → [7, 8, 9, 10]53. Intersect → [7, 9]64. Fetch chunks for series 7, 9

Posting lists are sorted by series ID, so intersecting two of them is a linear merge in . The query planner picks the smallest list first to keep intermediate results tight.

Schema Design (SQL Representation)

This is not the on-disk format — TSDBs do not store samples in a relational table. The SQL below is a reference model only, useful for reasoning about cardinality and indexing:

1-- Series registry (the inverted index in memory/disk)2CREATE TABLE series (3    id BIGSERIAL PRIMARY KEY,4    labels_hash BIGINT UNIQUE NOT NULL,5    labels JSONB NOT NULL,6    created_at TIMESTAMPTZ DEFAULT NOW()7);89CREATE INDEX idx_series_labels ON series USING GIN (labels);1011-- Samples (conceptual — stored in compressed chunks, not SQL)12CREATE TABLE samples (13    series_id BIGINT NOT NULL,14    timestamp TIMESTAMPTZ NOT NULL,15    value DOUBLE PRECISION NOT NULL,16    PRIMARY KEY (series_id, timestamp)17);1819-- Block metadata20CREATE TABLE blocks (21    ulid TEXT PRIMARY KEY,22    min_time TIMESTAMPTZ NOT NULL,23    max_time TIMESTAMPTZ NOT NULL,24    num_samples BIGINT NOT NULL,25    num_series INTEGER NOT NULL,26    level INTEGER DEFAULT 1,27    created_at TIMESTAMPTZ DEFAULT NOW()28);2930CREATE INDEX idx_blocks_time ON blocks(min_time, max_time);

Storage Selection Matrix

Data Type	Storage	Rationale
Active samples	Memory (memtable)	Sub-ms writes; recent data is hottest
WAL	Local SSD (sequential)	Durability; sequential write pattern
Hot blocks	Local SSD (random read)	Frequent queries; benefit from cache
Warm blocks	Networked SSD	Less frequent access; cost efficiency
Cold blocks	Object storage (S3/GCS)	Long retention at lowest $/GB
Inverted index	Memory + SSD	Fast label lookups, memory-mapped
Downsampled data	Separate retention tier	Skip raw scan for long ranges

Low-Level Design

Compression: Gorilla Algorithm

Compression is the core of TSDB efficiency. Facebook’s Gorilla paper (VLDB 2015) introduced delta-of-delta for timestamps and XOR for values, and reported ~1.37 bytes/sample on production traffic — roughly a 12x reduction over the naive (int64, double) layout². The same scheme (or a close variant) is what Prometheus, InfluxDB TSM, M3, and VictoriaMetrics use inside their chunks.

Timestamp Compression (Delta-of-Delta)

Samples typically arrive at regular intervals (e.g., every 10 seconds). Instead of storing absolute timestamps:

1// Delta-of-delta encoding2// Timestamps: [1000, 1010, 1020, 1030, 1040]3// Deltas:     [  -,   10,   10,   10,   10]4// Delta-of-delta: [-, -, 0, 0, 0]56// Encoding rules (Gorilla paper, Section 4.1.1):7// - If delta-of-delta = 0: write '0' (1 bit)8// - If fits in [-63, 64]: write '10' + 7 bits9// - If fits in [-255, 256]: write '110' + 9 bits10// - If fits in [-2047, 2048]: write '1110' + 12 bits11// - Otherwise: write '1111' + 32 bits1213function encodeTimestamps(timestamps: number[]): BitStream {14  const stream = new BitStream()1516  // First timestamp: stored as-is (64 bits)17  stream.writeBits(timestamps[0], 64)1819  if (timestamps.length < 2) return stream2021  // Second timestamp: delta from first22  let prevDelta = timestamps[1] - timestamps[0]23  stream.writeBits(prevDelta, 14)2425  // Remaining: delta-of-delta26  for (let i = 2; i < timestamps.length; i++) {27    const delta = timestamps[i] - timestamps[i - 1]28    const deltaOfDelta = delta - prevDelta2930    if (deltaOfDelta === 0) {31      stream.writeBit(0)32    } else if (deltaOfDelta >= -63 && deltaOfDelta <= 64) {33      stream.writeBits(0b10, 2)34      stream.writeBits(deltaOfDelta + 63, 7)35    } else if (deltaOfDelta >= -255 && deltaOfDelta <= 256) {36      stream.writeBits(0b110, 3)37      stream.writeBits(deltaOfDelta + 255, 9)38    } else if (deltaOfDelta >= -2047 && deltaOfDelta <= 2048) {39      stream.writeBits(0b1110, 4)40      stream.writeBits(deltaOfDelta + 2047, 12)41    } else {42      stream.writeBits(0b1111, 4)43      stream.writeBits(deltaOfDelta, 32)44    }4546    prevDelta = delta47  }4849  return stream50}

On Facebook’s workload, ~96% of timestamps land on the single-bit branch because scrapers fire on a fixed cadence².

Value Compression (XOR Encoding)

Consecutive metric values usually share most of their bits (e.g., CPU at 0.75 then 0.76):

1// XOR encoding for IEEE 754 doubles2// Values: [0.75, 0.76, 0.76, 0.77]3// XOR with previous: [0.75, XOR(0.75,0.76), XOR(0.76,0.76), XOR(0.76,0.77)]45// The XOR of similar floats has many leading and trailing zeros.6// We encode (leading-zeros count, significant-bit count, significant bits).78interface XORState {9  prevValue: bigint10  prevLeadingZeros: number11  prevTrailingZeros: number12}1314function encodeValues(values: number[]): BitStream {15  const stream = new BitStream()1617  const firstBits = floatToBits(values[0])18  stream.writeBits(firstBits, 64)1920  let state: XORState = {21    prevValue: firstBits,22    prevLeadingZeros: 0,23    prevTrailingZeros: 0,24  }2526  for (let i = 1; i < values.length; i++) {27    const currentBits = floatToBits(values[i])28    const xor = currentBits ^ state.prevValue2930    if (xor === 0n) {31      stream.writeBit(0)32    } else {33      stream.writeBit(1)3435      const leadingZeros = countLeadingZeros(xor)36      const trailingZeros = countTrailingZeros(xor)37      const significantBits = 64 - leadingZeros - trailingZeros3839      if (leadingZeros >= state.prevLeadingZeros && trailingZeros >= state.prevTrailingZeros) {40        // Fits the previous window: control '0' + significant bits41        stream.writeBit(0)42        const shift = 64 - state.prevLeadingZeros - significantBits43        stream.writeBits(xor >> BigInt(shift), significantBits)44      } else {45        // New window: control '1' + leading zeros + length + bits46        stream.writeBit(1)47        stream.writeBits(leadingZeros, 5)48        stream.writeBits(significantBits - 1, 6)49        stream.writeBits(xor >> BigInt(trailingZeros), significantBits)5051        state.prevLeadingZeros = leadingZeros52        state.prevTrailingZeros = trailingZeros53      }54    }5556    state.prevValue = currentBits57  }5859  return stream60}

Gorilla’s measured value bucket distribution²:

~51% of values: 1 bit (XOR is zero — value unchanged)
~30% of values: ~26.6 bits (control '10' path — XOR fits the previous leading/trailing-zero window)
~19% of values: ~36.9 bits (control '11' path — new window, 13 bits of overhead)

Note

The exact bucket percentages depend on workload (Facebook’s production gauges and counters). Workloads with noisier doubles compress less; integer-valued counters compress better.

Combined compression: Facebook’s published average across both timestamps and values is ~1.37 bytes/sample, versus 16 bytes raw — roughly a 12x reduction². Plan for a less rosy ratio in your own workload until you measure: ratios of 6-10x are common in practice once you mix in churning gauge values, irregular scrape intervals, and per-chunk overhead.

Write-Ahead Log

The WAL guarantees durability before acknowledging a write. Prometheus, for example, stores its WAL in 128 MB segments and keeps a minimum of three around so it can replay at least the head window after a crash¹:

1// WAL segment structure2// Segments are 128MB each, kept for minimum 3 files34interface WALSegment {5  id: number6  path: string7  size: number8  firstTimestamp: number9  lastTimestamp: number10}1112interface WALRecord {13  type: "series" | "samples" | "tombstone"14  data: Uint8Array15}1617class WriteAheadLog {18  private currentSegment: WALSegment19  private segments: WALSegment[] = []20  private readonly maxSegmentSize = 128 * 1024 * 1024 // 128MB21  private readonly minSegmentsKept = 32223  async append(records: WALRecord[]): Promise<void> {24    const encoded = this.encodeRecords(records)2526    if (this.currentSegment.size + encoded.length > this.maxSegmentSize) {27      await this.rotateSegment()28    }2930    await this.currentSegment.write(encoded)31    await this.currentSegment.sync() // fsync32  }3334  async truncate(checkpointSegmentId: number): Promise<void> {35    const toRemove = this.segments.filter(36      (s) => s.id < checkpointSegmentId && this.segments.length - 1 >= this.minSegmentsKept,37    )3839    for (const segment of toRemove) {40      await fs.unlink(segment.path)41      this.segments = this.segments.filter((s) => s.id !== segment.id)42    }43  }4445  async replay(): Promise<WALRecord[]> {46    const records: WALRecord[] = []47    for (const segment of this.segments) {48      const segmentRecords = await this.readSegment(segment)49      records.push(...segmentRecords)50    }51    return records52  }53}

Important

WAL replay is the longest single phase of cold-start recovery. Prometheus added optional WAL compression (default since v2.20) to keep replay time bounded. Plan capacity for WAL size based on the head window, not just steady-state ingestion.

In-Memory Index (Series Registry)

The in-memory index maps labels to series IDs for fast query planning:

1// Inverted index for label lookups.2// Posting lists are sorted, deduplicated arrays of series IDs.34type SeriesID = number5type PostingList = SeriesID[]67class InMemoryIndex {8  private postings: Map<string, Map<string, PostingList>> = new Map()9  private seriesLabels: Map<SeriesID, Map<string, string>> = new Map()10  private hashToSeries: Map<bigint, SeriesID> = new Map()1112  private nextSeriesID: SeriesID = 11314  getOrCreateSeries(labels: Map<string, string>): SeriesID {15    const hash = this.hashLabels(labels)1617    const existing = this.hashToSeries.get(hash)18    if (existing !== undefined) {19      return existing20    }2122    const id = this.nextSeriesID++23    this.hashToSeries.set(hash, id)24    this.seriesLabels.set(id, labels)2526    for (const [name, value] of labels) {27      if (!this.postings.has(name)) {28        this.postings.set(name, new Map())29      }30      const valueMap = this.postings.get(name)!31      if (!valueMap.has(value)) {32        valueMap.set(value, [])33      }34      valueMap.get(value)!.push(id)35    }3637    return id38  }3940  query(matchers: LabelMatcher[]): SeriesID[] {41    if (matchers.length === 0) {42      return Array.from(this.seriesLabels.keys())43    }4445    // Start with the smallest posting list (most selective)46    const sorted = [...matchers].sort((a, b) => this.getPostingListSize(a) - this.getPostingListSize(b))4748    let result = this.getPostingList(sorted[0])4950    for (let i = 1; i < sorted.length; i++) {51      const other = this.getPostingList(sorted[i])52      result = this.intersect(result, other)53      if (result.length === 0) break // Early exit54    }5556    return result57  }5859  private intersect(a: PostingList, b: PostingList): PostingList {60    // Sorted-merge intersection: O(n + m)61    const result: SeriesID[] = []62    let i = 0,63      j = 06465    while (i < a.length && j < b.length) {66      if (a[i] === b[j]) {67        result.push(a[i])68        i++69        j++70      } else if (a[i] < b[j]) {71        i++72      } else {73        j++74      }75    }7677    return result78  }79}

Block Compaction

Compaction merges small blocks into larger ones, lowering per-block overhead and improving query efficiency. Prometheus caps the result at 31 days or 10% of the configured retention, whichever is smaller¹:

Compaction levels: four 2-hour level-0 blocks merge into one 8-hour level-1 block, which then merges with peers into a 24-hour or longer level-2 block, capped at 31 days or 10% of retention. — Compaction merges small blocks into larger ones, capped to keep block deletion granular.

1// Compaction strategy2// - Level 0: 2-hour blocks (from memtable flush)3// - Level 1: 8-hour blocks (4 level-0 merged)4// - Level 2: 24-hour blocks5// - Max block size: 31 days OR 10% of retention (whichever is smaller)67interface CompactionPlan {8  sources: Block[]9  targetLevel: number10  estimatedSize: number11}1213class Compactor {14  private readonly maxBlockDuration = 31 * 24 * 60 * 60 * 1000 // 31 days1516  async planCompaction(blocks: Block[]): Promise<CompactionPlan[]> {17    const plans: CompactionPlan[] = []18    const byLevel = this.groupByLevel(blocks)1920    for (const [level, levelBlocks] of byLevel) {21      const sorted = levelBlocks.sort((a, b) => a.minTime - b.minTime)2223      for (let i = 0; i < sorted.length - 1; i++) {24        const candidates = [sorted[i]]25        let j = i + 12627        while (j < sorted.length && sorted[j].minTime === candidates[candidates.length - 1].maxTime) {28          candidates.push(sorted[j])29          j++3031          const duration = candidates[j - 1].maxTime - candidates[0].minTime32          if (duration > this.maxBlockDuration) break3334          if (candidates.length >= 4) break35        }3637        if (candidates.length >= 2) {38          plans.push({39            sources: candidates,40            targetLevel: level + 1,41            estimatedSize: candidates.reduce((sum, b) => sum + b.size, 0) * 0.9,42          })43          i = j - 144        }45      }46    }4748    return plans49  }5051  async compact(plan: CompactionPlan): Promise<Block> {52    const iterators = plan.sources.map((b) => b.createIterator())53    const mergedSamples = this.mergeSortIterators(iterators)54    const newBlock = await this.writeBlock(mergedSamples, plan.targetLevel)5556    await this.swapBlocks(plan.sources, newBlock)57    for (const source of plan.sources) {58      await source.delete()59    }60    return newBlock61  }62}

Caution

Capping the largest block size matters operationally: deletes happen at block granularity. A 90-day mega-block wastes disk on a 7-day delete request, so the 10%-of-retention cap is not arbitrary.

Downsampling and Retention Tiers

Long-range queries cannot afford to scan years of raw 10-second samples. Downsampling pre-computes lower-resolution rollups and routes queries to the right tier:

Retention lifecycle: raw 10-second samples on hot tier (15 days) downsample into 1-minute aggregates on warm tier (90 days) and 1-hour aggregates on cold tier (2 years); each tier discards data past its retention window. — Raw, 1-minute, and 1-hour tiers with retention windows and downsampling transitions.

1// Downsampling tiers2// - Raw: 10-second resolution, 15 days3// - 1-minute: min/max/sum/count/avg, 90 days4// - 1-hour: min/max/sum/count/avg, 2 years56interface DownsampleConfig {7  sourceResolution: string8  targetResolution: string9  aggregations: string[]10  retention: number11}1213interface DownsampledSample {14  timestamp: number // Bucket start time15  min: number16  max: number17  sum: number18  count: number19  // Derived: avg = sum / count20}2122class Downsampler {23  async downsample(seriesId: number, sourceBlock: Block, config: DownsampleConfig): Promise<DownsampledSample[]> {24    const bucketMs = this.parseDuration(config.targetResolution)25    const buckets = new Map<number, DownsampledSample>()2627    const iterator = sourceBlock.createIterator(seriesId)2829    for await (const sample of iterator) {30      const bucketStart = Math.floor(sample.timestamp / bucketMs) * bucketMs3132      if (!buckets.has(bucketStart)) {33        buckets.set(bucketStart, {34          timestamp: bucketStart,35          min: sample.value,36          max: sample.value,37          sum: sample.value,38          count: 1,39        })40      } else {41        const bucket = buckets.get(bucketStart)!42        bucket.min = Math.min(bucket.min, sample.value)43        bucket.max = Math.max(bucket.max, sample.value)44        bucket.sum += sample.value45        bucket.count++46      }47    }4849    return Array.from(buckets.values())50  }51}5253// Query routing: pick the cheapest resolution that still answers the question.54function selectResolution(queryRange: number): string {55  if (queryRange <= 24 * 60 * 60 * 1000) return "raw"            // ≤ 1 day56  if (queryRange <= 30 * 24 * 60 * 60 * 1000) return "1m"        // ≤ 30 days57  return "1h"58}

Cardinality Management

High cardinality is the single most common operational fire in production TSDBs. Each unique (metric name, label set) combination is a series with its own posting-list entry and its own in-memory state; an unbounded label like request_id can multiply series by orders of magnitude in minutes.

1interface CardinalityLimits {2  maxSeriesPerMetric: number       // e.g., 100,0003  maxTotalSeries: number           // e.g., 10,000,0004  maxLabelValueCardinality: number // e.g., 10,000 per label name5}67class CardinalityTracker {8  private seriesPerMetric: Map<string, Set<number>> = new Map()9  private seriesPerLabel: Map<string, Map<string, number>> = new Map()10  private totalSeries: number = 01112  checkLimits(labels: Map<string, string>, limits: CardinalityLimits): CardinalityError | null {13    const metricName = labels.get("__name__")!1415    const metricSeries = this.seriesPerMetric.get(metricName)?.size ?? 016    if (metricSeries >= limits.maxSeriesPerMetric) {17      return {18        type: "metric_cardinality_exceeded",19        metric: metricName,20        current: metricSeries,21        limit: limits.maxSeriesPerMetric,22      }23    }2425    for (const [name, value] of labels) {26      const labelValues = this.seriesPerLabel.get(name)27      if (labelValues && labelValues.size >= limits.maxLabelValueCardinality) {28        return {29          type: "label_cardinality_exceeded",30          label: name,31          current: labelValues.size,32          limit: limits.maxLabelValueCardinality,33        }34      }35    }3637    if (this.totalSeries >= limits.maxTotalSeries) {38      return {39        type: "total_cardinality_exceeded",40        current: this.totalSeries,41        limit: limits.maxTotalSeries,42      }43    }4445    return null46  }4748  // Surface labels that are blowing up cardinality so operators can act.49  getHighCardinalityLabels(threshold: number): { label: string; cardinality: number }[] {50    const results: { label: string; cardinality: number }[] = []51    for (const [label, values] of this.seriesPerLabel) {52      if (values.size > threshold) {53        results.push({ label, cardinality: values.size })54      }55    }56    return results.sort((a, b) => b.cardinality - a.cardinality)57  }58}

Warning

Common high-cardinality antipatterns: request_id, trace_id, user_id, raw URLs with query strings, IP addresses, UUIDs, free-form error messages. Move these to exemplars or logs/traces (e.g., Prometheus exemplars) instead of raising them as labels. Reject offending series at the distributor — silent dropping is harder to debug than a clear 429.

Frontend Considerations

Dashboard Query Optimization

A dashboard with 20 panels, each pulling 1 week of data at 1-minute resolution, generates datapoints — most of which the browser cannot meaningfully render anyway.

Working levers:

Step alignment — query at the dashboard refresh resolution, not the storage resolution.
Parallel queries — fan out panel fetches concurrently.
Query caching — cache identical queries server-side and CDN-side.
Streaming updates — WebSocket / SSE for live tails, HTTP for historical scrolls.

1interface DashboardQuery {2  panelId: string3  expr: string4  range: { start: number; end: number }5  step: number6}78async function fetchDashboard(queries: DashboardQuery[]): Promise<Map<string, QueryResult>> {9  // Group queries by time range (panels typically share one range)10  const byRange = groupBy(queries, (q) => `${q.range.start}-${q.range.end}`)1112  const results = new Map<string, QueryResult>()1314  await Promise.all(15    Object.entries(byRange).map(async ([, rangeQueries]) => {16      // Batch multiple expressions if the backend supports it17      const batchResult = await fetchBatch(rangeQueries)18      for (const [panelId, result] of batchResult) {19        results.set(panelId, result)20      }21    }),22  )2324  return results25}

Time Range Selection

UX considerations:

Relative ranges (Last 1 hour, Last 7 days) for monitoring
Absolute ranges for incident investigation
Auto-refresh interval matching the query range
Timezone displayed in the user’s local time, queried in UTC

Graph Rendering

Performance patterns:

Downsample on the client for display — a <canvas> cannot meaningfully render 10K points per panel
Use uPlot or WebGL for high-density graphs (Grafana ships uPlot for many panel types)
Lazy-load panels as they scroll into view
Debounce zoom and pan operations

Infrastructure Design

Cloud-Agnostic Concepts

Component	Requirement	Options
Ingest / Query	Stateless, auto-scaling	Kubernetes Deployment, ECS Service
Storage Engine	Stateful, local SSD	StatefulSet with local volumes
Block Storage	High IOPS, low latency	Local NVMe, EBS gp3, Persistent Disks
Object Storage	Cold tier, cheap, durable	S3, GCS, MinIO
Coordination	Leader election, config	etcd, Consul, ZooKeeper

AWS Reference Architecture

Component	AWS Service	Configuration
Load Balancer	NLB	TCP passthrough, cross-AZ
Distributors	EKS (Fargate)	3-10 pods, 2 vCPU / 4 GB each
Ingesters	EKS (EC2)	i3/i4i.xlarge (NVMe), 3+ nodes, StatefulSet
Query Engines	EKS (Fargate)	2-10 pods, 4 vCPU / 8 GB each
Cold Storage	S3 Standard-IA	Lifecycle to Glacier after 90 days
Query Cache	ElastiCache Redis	cache.r6g.large, cluster mode
Coordination	EKS control plane	Built-in etcd

Self-Hosted Alternatives

Managed Service	Self-Hosted	When to Self-Host
ElastiCache	Redis on EC2	Cost, specific modules
S3	MinIO	On-premises, data sovereignty
EKS	k3s/k8s on EC2	Cost, full control

High Availability

Replication strategy for ingesters:

Distributor writes to N ingesters (N=3 typical)
Query engine reads from any one replica, dedupes
Consistency: eventual — recently-written samples may lag a replica
Durability: WAL replicated, blocks uploaded to object storage

Failure scenarios:

Ingester crash — WAL is replayed on the replacement; recent samples can be back-filled from peers.
Query engine crash — stateless, the load balancer routes to the next healthy instance.
Object storage unavailable — serve hot data from ingesters, fail cold queries gracefully with a clear error.
Distributor partition — drop samples for the partitioned tenant rather than blocking the cluster; alert on increased 429/503 rate.

Conclusion

The design prioritizes write efficiency (LSM head + WAL), storage efficiency (Gorilla-class chunks), and query performance (inverted index + time partitioning). The same shape scales from a single Prometheus node up to a distributed Cortex/Mimir/M3 cluster.

Key architectural decisions:

LSM-style storage with 2-hour blocks — balances write amplification against query efficiency. Compaction merges blocks up to the smaller of 31 days or 10% of retention¹.
Gorilla compression (delta-of-delta + XOR) — averages ~1.37 bytes/sample on Facebook’s workload, with ~96% of timestamps collapsing to 1 bit at regular scrape intervals².
Inverted index with posting-list intersection — label queries are linear merges over sorted posting lists; memory-mapped to handle large cardinalities.
Tiered retention with downsampling — raw 15 days → 1-minute 90 days → 1-hour 2 years. Queries auto-select the cheapest resolution that still answers the question.
Cardinality limits as a first-class feature — per-metric, per-label, and total-series budgets prevent the head block from OOMing on a single bad label.

Limitations and future directions:

Cross-series queries — aggregations over many series remain expensive; pre-computed recording rules and continuous aggregates help.
Exemplars and traces — linking a metric sample to a trace is increasingly table stakes; budget for exemplar storage and ingestion.
Multi-tenancy — proper isolation requires per-tenant rate limiting, query queueing, and chargeback metering.

Appendix

Prerequisites

Distributed-systems fundamentals (consistency, availability, partitioning)
Database internals (LSM trees, B-trees, write-ahead logs)
Data compression concepts
Familiarity with metrics and monitoring

Terminology

Sample — a (timestamp, value) pair for a specific series.
Series — a unique metric identified by its name and label set.
Cardinality — the number of unique series.
TSDB — Time Series Database, optimized for time-indexed, append-only data.
LSM Tree — Log-Structured Merge Tree, a storage structure using sorted runs and background compaction.
WAL — Write-Ahead Log, a sequential log used for durability before in-memory commit.
Gorilla — Facebook’s compression algorithm for time-series (VLDB 2015).
PromQL — Prometheus Query Language, a functional query language for metrics.
Posting list — sorted list of series IDs matching a label value.

Summary

TSDBs exploit append-only, time-ordered access for ~10x compression and fast range queries.
Gorilla averages ~1.37 bytes/sample with ~96% of timestamps compressing to 1 bit on regular scrape intervals².
LSM-style storage with 2-hour blocks balances write throughput against query efficiency; compaction caps at 31 days or 10% of retention¹.
Inverted index maps labels to series IDs; sorted posting-list intersection drives label queries.
Cardinality management is critical — high-cardinality labels (request_id, user_id) must be blocked or aggregated.
Tiered retention with downsampling (raw → 1m → 1h) keeps multi-year storage affordable.
A single node can handle 1M+ samples/sec with proper tuning; distributed designs (M3, Mimir, Cortex) reach 1B+⁴.

References

Gorilla: A Fast, Scalable, In-Memory Time Series Database — Facebook, VLDB 2015. Compression algorithm and 26-hour in-memory architecture.
Prometheus storage documentation — block format, WAL, compaction, retention.
Prometheus TSDB index format — on-disk index spec.
InfluxDB Storage Engine (TSM) — TSM and TSI architecture.
M3: Uber’s Open Source, Large-scale Metrics Platform — distributed TSDB at scale.
The Billion Data Point Challenge — Uber engineering on M3 query scale.
Cortex Architecture — multi-tenant, Prometheus-compatible storage.
Grafana Mimir — Cortex’s spiritual successor; horizontally scalable Prometheus.
VictoriaMetrics vs Prometheus — performance and architecture comparison.
TimescaleDB hypertables — relational TSDB with hypertables and continuous aggregates.
QuestDB storage model — column-oriented TSDB.
PromQL documentation — query language reference.

Prometheus project, Storage documentation. Defines 2-hour head blocks, 128 MB WAL segments, and the “31 days or 10% of retention, whichever is smaller” compaction cap. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Pelkonen et al., Gorilla: A Fast, Scalable, In-Memory Time Series Database, VLDB 2015. Reports ~1.37 bytes/sample after delta-of-delta + XOR compression, ~96% of timestamps as a single bit, and a 26-hour in-memory window in production at Facebook. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
InfluxData, InfluxDB v1 Storage Engine documentation and Database management — shard group duration. Shard-group duration defaults vary with retention: ≤2 days → 1 hour, >2 days and <6 months → 1 day, ≥6 months (and the default autogen infinite-retention policy) → 7 days. ↩
Uber Engineering, The Billion Data Point Challenge. Reports an internal M3 query engine handling ~8.5 billion data points per second as of 2018. ↩ ↩²