Design a Time Series Database

A comprehensive system design for a metrics and monitoring time-series database (TSDB) handling high-velocity writes, efficient compression, and long-term retention. This design addresses write throughput at millions of samples/second, sub-millisecond queries over billions of datapoints, cardinality management for dimensional data, and multi-tier storage for cost-effective retention.

Mermaid diagram — High-level architecture: Metrics flow through collectors to a write-ahead log, buffer in memory, then compact to disk. Queries traverse an inverted index to locate compressed blocks. Lifecycle services handle downsampling and retention.

Abstract

Time-series databases optimize for append-only workloads with time-ordered reads. The fundamental insight: metrics arrive in near-monotonic timestamp order and are queried by time range—exploiting this pattern enables compression ratios of 10-12x and query performance orders of magnitude better than general-purpose databases.

The storage engine uses a Log-Structured Merge Tree (LSM) variant. Writes append to a Write-Ahead Log (WAL), buffer in memory, then flush to immutable sorted blocks. Time-based partitioning (typically 2-hour blocks) enables efficient range queries—scanning only relevant blocks instead of the entire dataset.

Compression is the critical differentiator. Facebook’s Gorilla algorithm achieves 12x compression (16 bytes → 1.37 bytes per sample) using delta-of-delta for timestamps and XOR encoding for values. 96% of timestamps compress to a single bit when samples arrive at regular intervals.

Cardinality (unique combinations of metric name + tags) is the primary scaling constraint. A metric like http_requests{method, status, endpoint} explodes to millions of series with high-cardinality labels. The inverted index must efficiently map tag combinations to series IDs without consuming unbounded memory.

Requirements

Functional Requirements

Feature	Priority	Scope
Metric ingestion (push model)	Core	Full
Range queries by time	Core	Full
Tag-based filtering	Core	Full
Aggregations (sum, avg, max, p99)	Core	Full
Downsampling and rollups	Core	Full
Retention policies	Core	Full
Alerting integration	High	Overview
Multi-tenancy	High	Overview
Distributed queries	High	Full
Cardinality management	High	Full
Exemplar storage (traces)	Medium	Brief
Anomaly detection	Low	Out of scope

Non-Functional Requirements

Requirement	Target	Rationale
Write throughput	1M+ samples/sec per node	Modern infrastructure generates massive telemetry
Read latency (hot data)	p99 < 100ms	Dashboard refresh and alerting requirements
Read latency (cold data)	p99 < 5s	Historical analysis acceptable with higher latency
Storage efficiency	< 2 bytes/sample compressed	Cost-effective retention of years of data
Availability	99.9%	Monitoring systems must be highly available
Data durability	99.999%	Metrics are valuable for incident investigation
Retention	Raw: 15 days, Aggregated: 2 years	Balance granularity vs. storage cost

Scale Estimation

Metric Sources:

Hosts: 100K servers, containers, and VMs
Metrics per host: 500 metrics (CPU, memory, disk, network, application)
Collection interval: 10 seconds

Traffic:

Samples/second: 100K hosts × 500 metrics / 10s = 5M samples/sec
Daily samples: 5M × 86,400 = 432B samples/day
Peak multiplier: 2x during deployments → 10M samples/sec burst

Storage (before compression):

Sample size: 16 bytes (8-byte timestamp + 8-byte value)
Daily raw: 432B × 16 bytes = 6.9TB/day
With Gorilla compression (12x): 6.9TB / 12 = 575GB/day
15-day retention: ~8.6TB
2-year downsampled (1-minute aggregates): ~2TB

Cardinality:

Unique series (metric + tags): 50M active series
Series per host average: 500
High-cardinality metrics (by request_id): Must be blocked or aggregated

Design Paths

Path A: In-Memory with Disk Spillover (Gorilla/Atlas)

Best when:

Recent data is most valuable (last few hours)
Predictable memory budget
Sub-millisecond query latency required

Architecture:

All active series in memory with ring buffers
Fixed memory allocation per series
Disk used only for overflow and persistence

Trade-offs:

✅ Sub-millisecond query latency for recent data
✅ Simple query path (no disk I/O for hot queries)
✅ Predictable memory usage
❌ Limited retention (hours to days)
❌ Memory cost scales linearly with cardinality
❌ Cold queries require separate system

Real-world example: Netflix Atlas uses in-memory storage with rolling windows for real-time operational dashboards. Facebook Gorilla keeps 26 hours in memory with ~2GB per 700K series.

Path B: LSM-Based with Tiered Storage (Prometheus/InfluxDB)

Best when:

Long retention required (weeks to months)
Mixed hot/cold query patterns
Storage cost is a concern

Architecture:

Write-ahead log for durability
In-memory buffer (2-hour blocks)
Background compaction to disk blocks
Object storage for cold tier

Trade-offs:

✅ Cost-effective long retention
✅ Excellent compression (12x+)
✅ Handles variable cardinality
❌ Higher query latency for cold data
❌ Compaction CPU overhead
❌ Write amplification during compaction

Real-world example: Prometheus TSDB uses 2-hour blocks, achieving 1-2 bytes/sample. InfluxDB’s TSM engine follows similar principles with 7-day shards.

Path C: Distributed LSM with Global Index (M3/Cortex)

Best when:

Scale beyond single node
Multi-tenant SaaS offering
Global query federation required

Architecture:

Distributed hash ring for series assignment
Per-tenant isolated storage
Query fanout with aggregation
Centralized or per-tenant indexing

Trade-offs:

✅ Horizontal scalability (1B+ samples/sec)
✅ Multi-tenant isolation
✅ Geographic distribution possible
❌ Operational complexity
❌ Network overhead for distributed queries
❌ Consistency challenges during rebalancing

Real-world example: Uber M3 ingests 1B+ datapoints/sec across their fleet. Cortex provides multi-tenant Prometheus-compatible storage.

Path Comparison

Factor	Path A (In-Memory)	Path B (LSM)	Path C (Distributed)
Query latency (hot)	Sub-ms	10-100ms	50-500ms
Retention	Hours	Weeks-Months	Years
Storage efficiency	Low (in-memory)	High (12x compression)	High
Cardinality limit	Memory-bound	Disk-bound	Configurable
Operational complexity	Low	Medium	High
Best for	Real-time dashboards	Single-cluster monitoring	Multi-tenant SaaS

This Article’s Focus

This article implements Path B (LSM-Based) as the core, with distributed extensions from Path C. This matches most production deployments (Prometheus, InfluxDB, VictoriaMetrics) and provides a foundation for understanding both single-node and distributed architectures.

High-Level Design

Service Architecture

Ingestion Service (Distributor)

Receives metrics from agents and routes to storage nodes:

Parse and validate metric format (Prometheus exposition, InfluxDB line protocol, OpenTelemetry)
Extract metric name, tags, timestamp, value
Hash series ID (metric + sorted tags) for consistent routing
Forward to appropriate storage node(s)
Handle backpressure via load shedding

Storage Engine (Ingester)

Manages the write path and local storage:

Append to Write-Ahead Log (WAL) for durability
Buffer samples in memory (memtable per series)
Maintain in-memory index for active series
Flush to immutable blocks every 2 hours
Trigger compaction for block merging

Compaction Service

Background process optimizing storage:

Merge small blocks into larger ones
Apply compression algorithms
Remove deleted/expired data
Rebuild indexes for merged blocks
Enforce retention by deleting old blocks

Query Engine

Executes queries across storage tiers:

Parse query language (PromQL, InfluxQL, SQL)
Plan execution (which blocks to scan)
Fetch data from index → blocks
Decompress and aggregate
Return results with metadata

Lifecycle Manager

Handles data retention and downsampling:

Enforce time-based retention policies
Trigger downsampling jobs (raw → 1-minute → 1-hour aggregates)
Manage tiered storage migration (hot → warm → cold)
Track storage usage per tenant

Data Flow: Write Path

Data Flow: Query Path

API Design

Write Endpoint

Endpoint: POST /api/v1/write

Accepts metrics in Prometheus remote-write format (protobuf) or line protocol:

1
# Prometheus exposition format
2
http_requests_total{method="GET",status="200"} 1234 1704067200000
3
http_requests_total{method="POST",status="500"} 5 1704067200000
4
cpu_usage{host="server-1",core="0"} 0.75 1704067200000
5

6
# InfluxDB line protocol
7
http_requests,method=GET,status=200 total=1234 1704067200000000000
8
cpu,host=server-1,core=0 usage=0.75 1704067200000000000

Request (Prometheus remote-write protobuf):


5 collapsed lines
1
message WriteRequest {
2
  repeated TimeSeries timeseries = 1;
3
  repeated MetricMetadata metadata = 2;
4
}
5

6
message TimeSeries {
7
  repeated Label labels = 1;  // [{name: "__name__", value: "http_requests_total"}, ...]
8
  repeated Sample samples = 2;
9
}
10

11
message Sample {
12
  double value = 1;
13
  int64 timestamp = 2;  // Unix ms
14
}

Response:

204 No Content: Success
400 Bad Request: Invalid format, missing required labels
429 Too Many Requests: Rate limited (backpressure)
503 Service Unavailable: Ingester overloaded

Design Decision: Push vs. Pull

Prometheus uses a pull model (scraping targets), while most TSDBs accept push. For a general-purpose TSDB:

Push (chosen): Simpler client integration, works behind firewalls, scales with stateless distributors
Pull: Better for service discovery, guarantees scrape intervals, but requires connectivity to all targets

Remote-write enables hybrid: Prometheus scrapes locally, pushes to central TSDB.

Query Endpoint

Endpoint: GET /api/v1/query_range

Query Parameters:

Parameter	Type	Description
`query`	string	Query expression (PromQL, InfluxQL)
`start`	int64	Start timestamp (Unix seconds or RFC3339)
`end`	int64	End timestamp
`step`	duration	Query resolution (e.g., “1m”, “5m”)

Example Query:

1
GET /api/v1/query_range?query=rate(http_requests_total{status="500"}[5m])&start=1704067200&end=1704153600&step=60

Response:


3 collapsed lines
1
{
2
  "status": "success",
3
  "data": {
4
    "resultType": "matrix",
5
    "result": [
6
      {
7
        "metric": {
8
          "__name__": "http_requests_total",
9
          "method": "GET",
10
          "status": "500",
11
          "instance": "server-1:9090"
12
        },
13
        "values": [
14
          [1704067200, "0.5"],
15
          [1704067260, "0.7"],
16
          [1704067320, "0.3"]
17
        ]
18
      }
19
    ]
20
  }
21
}

Rate Limits: 100 queries/min per tenant, 10 concurrent queries

Label Endpoints

Get label names: GET /api/v1/labels

Get label values: GET /api/v1/label/{label_name}/values

These endpoints query the inverted index and are essential for building dashboards with dynamic dropdowns.

Series Metadata

Endpoint: GET /api/v1/series

Returns series matching a label selector:

1
GET /api/v1/series?match[]=http_requests_total{status=~"5.."}

Used for cardinality analysis and debugging high-cardinality labels.

Data Modeling

Sample Format

Primary data unit: A sample is a (timestamp, value) pair for a specific series.

1
interface Sample {
2
  timestamp: number // Unix milliseconds
3
  value: number // IEEE 754 double (64-bit)
4
}
5

6
interface Series {
7
  labels: Map<string, string> // Metric name + tags
8
  samples: Sample[]
9
}
10

11
// Series identifier (sorted label pairs)
12
// http_requests_total{method="GET",status="200"}
13
// → __name__=http_requests_total,method=GET,status=200
14
// → Hash: fnv64a("__name__\xff http_requests_total\xff method\xff GET\xff ...")

Design Decision: Label Ordering

Labels are sorted alphabetically before hashing. This ensures the same series always hashes to the same ID regardless of label order in the write request.

Block Format

Block structure (2-hour time window):

1
block-<ulid>/
2
├── meta.json           # Block metadata
3
├── index               # Inverted index (labels → series)
4
├── chunks/
5
│   ├── 000001          # Compressed chunk file
6
│   ├── 000002
7
│   └── ...
8
└── tombstones          # Deleted series markers

meta.json:


3 collapsed lines
1
{
2
  "ulid": "01HQGJ5P1XXXXXXXXX",
3
  "minTime": 1704067200000,
4
  "maxTime": 1704074400000,
5
  "stats": {
6
    "numSamples": 50000000,
7
    "numSeries": 100000,
8
    "numChunks": 150000
9
  },
10
  "compaction": {
11
    "level": 1,
12
    "sources": ["01HQGJ..."]
13
  },
14
  "version": 1
15
}

Index Structure

Inverted index mapping:

1
Label → Series IDs
2
──────────────────
3
method=GET    → [1, 3, 5, 7, 9, ...]
4
method=POST   → [2, 4, 6, 8, ...]
5
status=200    → [1, 2, 3, 4, 5, ...]
6
status=500    → [6, 7, 8, ...]
7

8
Series ID → Chunk Refs
9
──────────────────────
10
1 → [(chunk_001, offset=0, len=4096), (chunk_002, offset=0, len=2048)]
11
2 → [(chunk_001, offset=4096, len=3072)]

Posting list intersection for queries:

1
http_requests{method="GET",status="500"}
2

3
1. Lookup method=GET → [1, 3, 5, 7, 9]
4
2. Lookup status=500 → [7, 8, 9, 10]
5
3. Intersect → [7, 9]
6
4. Fetch chunks for series 7, 9

Schema Design (SQL Representation)

For understanding, here’s how the data model maps to relational concepts:


5 collapsed lines
1
-- Series registry (the inverted index in memory/disk)
2
CREATE TABLE series (
3
    id BIGSERIAL PRIMARY KEY,
4
    labels_hash BIGINT UNIQUE NOT NULL,
5
    labels JSONB NOT NULL,
6
    created_at TIMESTAMPTZ DEFAULT NOW()
7
);
8

9
CREATE INDEX idx_series_labels ON series USING GIN (labels);
10

11
-- Samples (conceptual - stored in compressed chunks, not SQL)
12
CREATE TABLE samples (
13
    series_id BIGINT NOT NULL,
14
    timestamp TIMESTAMPTZ NOT NULL,
15
    value DOUBLE PRECISION NOT NULL,
16
    PRIMARY KEY (series_id, timestamp)
17
);
18

19
-- Block metadata
20
CREATE TABLE blocks (
21
    ulid TEXT PRIMARY KEY,
22
    min_time TIMESTAMPTZ NOT NULL,
23
    max_time TIMESTAMPTZ NOT NULL,
24
    num_samples BIGINT NOT NULL,
25
    num_series INTEGER NOT NULL,
26
    level INTEGER DEFAULT 1,
27
    created_at TIMESTAMPTZ DEFAULT NOW()
28
);
29

30
CREATE INDEX idx_blocks_time ON blocks(min_time, max_time);

Storage Selection Matrix

Data Type	Storage	Rationale
Active samples	Memory (memtable)	Sub-ms writes, recent data hottest
WAL	Local SSD (sequential)	Durability, sequential writes
Hot blocks	Local SSD (random read)	Frequent queries, compression
Warm blocks	Networked SSD	Less frequent access, cost efficiency
Cold blocks	Object storage (S3)	Long retention, lowest cost
Inverted index	Memory + SSD	Fast label lookups, memory-mapped
Downsampled data	Separate retention tier	Query without scanning raw data

Low-Level Design

Compression: Gorilla Algorithm

The compression algorithm is the core of TSDB efficiency. Facebook’s Gorilla paper (VLDB 2015) introduced delta-of-delta for timestamps and XOR for values.

Timestamp Compression (Delta-of-Delta)

Samples typically arrive at regular intervals (e.g., every 10 seconds). Instead of storing absolute timestamps:


8 collapsed lines
1
// Delta-of-delta encoding
2
// Timestamps: [1000, 1010, 1020, 1030, 1040]
3
// Deltas:     [  -,   10,   10,   10,   10]
4
// Delta-of-delta: [-, -, 0, 0, 0]
5

6
// Encoding rules:
7
// - If delta-of-delta = 0: write '0' (1 bit)
8
// - If fits in [-63, 64]: write '10' + 7 bits
9
// - If fits in [-255, 256]: write '110' + 9 bits
10
// - If fits in [-2047, 2048]: write '1110' + 12 bits
11
// - Otherwise: write '1111' + 32 bits
12

13
function encodeTimestamps(timestamps: number[]): BitStream {
14
  const stream = new BitStream()
15

16
  // First timestamp: stored as-is (64 bits)
17
  stream.writeBits(timestamps[0], 64)
18

19
  if (timestamps.length < 2) return stream
20

21
  // Second timestamp: delta from first
22
  let prevDelta = timestamps[1] - timestamps[0]
23
  stream.writeBits(prevDelta, 14) // Assuming max delta fits in 14 bits
24

25
  // Remaining: delta-of-delta
26
  for (let i = 2; i < timestamps.length; i++) {
27
    const delta = timestamps[i] - timestamps[i - 1]
28
    const deltaOfDelta = delta - prevDelta
29

30
    if (deltaOfDelta === 0) {
31
      stream.writeBit(0) // 1 bit for regular intervals
32
    } else if (deltaOfDelta >= -63 && deltaOfDelta <= 64) {
33
      stream.writeBits(0b10, 2)
34
      stream.writeBits(deltaOfDelta + 63, 7) // Offset to positive
35
    } else if (deltaOfDelta >= -255 && deltaOfDelta <= 256) {
36
      stream.writeBits(0b110, 3)
37
      stream.writeBits(deltaOfDelta + 255, 9)
38
    } else if (deltaOfDelta >= -2047 && deltaOfDelta <= 2048) {
39
      stream.writeBits(0b1110, 4)
11 collapsed lines
40
      stream.writeBits(deltaOfDelta + 2047, 12)
41
    } else {
42
      stream.writeBits(0b1111, 4)
43
      stream.writeBits(deltaOfDelta, 32)
44
    }
45

46
    prevDelta = delta
47
  }
48

49
  return stream
50
}
51

52
// Result: 96% of timestamps compress to 1 bit
53
// Average: 1.04 bits per timestamp (vs 64 bits uncompressed)

Value Compression (XOR Encoding)

Consecutive metric values often share many bits (e.g., CPU at 0.75 then 0.76):


10 collapsed lines
1
// XOR encoding for IEEE 754 doubles
2
// Values: [0.75, 0.76, 0.76, 0.77]
3
// XOR with previous: [0.75, XOR(0.75,0.76), XOR(0.76,0.76), XOR(0.76,0.77)]
4

5
// The XOR of similar floats has many leading and trailing zeros
6
// We encode: (leading zeros count, significant bits count, significant bits)
7

8
interface XORState {
9
  prevValue: bigint // Previous value as 64-bit integer
10
  prevLeadingZeros: number
11
  prevTrailingZeros: number
12
}
13

14
function encodeValues(values: number[]): BitStream {
15
  const stream = new BitStream()
16

17
  // First value: stored as-is (64 bits)
18
  const firstBits = floatToBits(values[0])
19
  stream.writeBits(firstBits, 64)
20

21
  let state: XORState = {
22
    prevValue: firstBits,
23
    prevLeadingZeros: 0,
24
    prevTrailingZeros: 0,
25
  }
26

27
  for (let i = 1; i < values.length; i++) {
28
    const currentBits = floatToBits(values[i])
29
    const xor = currentBits ^ state.prevValue
30

31
    if (xor === 0n) {
32
      // Identical value: write '0' (1 bit)
33
      stream.writeBit(0)
34
    } else {
35
      stream.writeBit(1) // Values differ
36

37
      const leadingZeros = countLeadingZeros(xor)
38
      const trailingZeros = countTrailingZeros(xor)
39
      const significantBits = 64 - leadingZeros - trailingZeros
40

41
      if (leadingZeros >= state.prevLeadingZeros && trailingZeros >= state.prevTrailingZeros) {
42
        // Fits in previous window: write '0' + significant bits
43
        stream.writeBit(0)
44
        const shift = 64 - state.prevLeadingZeros - significantBits
45
        stream.writeBits(xor >> BigInt(shift), significantBits)
46
      } else {
47
        // New window: write '1' + leading zeros + length + bits
48
        stream.writeBit(1)
49
        stream.writeBits(leadingZeros, 5)
11 collapsed lines
50
        stream.writeBits(significantBits - 1, 6) // -1 because min is 1
51
        stream.writeBits(xor >> BigInt(trailingZeros), significantBits)
52

53
        state.prevLeadingZeros = leadingZeros
54
        state.prevTrailingZeros = trailingZeros
55
      }
56
    }
57

58
    state.prevValue = currentBits
59
  }
60

61
  return stream
62
}
63

64
// Result:
65
// - 51% of values compress to 1 bit (identical to previous)
66
// - 30% compress to ~27 bits (control '10' path)
67
// - 19% compress to ~37 bits (control '11' path)
68
// Average: 1.33 bytes per value (vs 8 bytes uncompressed)

Combined compression ratio:

Timestamp: 1.04 bits (0.13 bytes)
Value: 10.6 bits (1.33 bytes)
Total: 1.46 bytes/sample (vs 16 bytes raw = 11x compression)

Write-Ahead Log

The WAL ensures durability before acknowledging writes:


10 collapsed lines
1
// WAL segment structure
2
// Segments are 128MB each, kept for minimum 3 files
3

4
interface WALSegment {
5
  id: number
6
  path: string
7
  size: number
8
  firstTimestamp: number
9
  lastTimestamp: number
10
}
11

12
interface WALRecord {
13
  type: "series" | "samples" | "tombstone"
14
  data: Uint8Array
15
}
16

17
class WriteAheadLog {
18
  private currentSegment: WALSegment
19
  private segments: WALSegment[] = []
20
  private readonly maxSegmentSize = 128 * 1024 * 1024 // 128MB
21
  private readonly minSegmentsKept = 3
22

23
  async append(records: WALRecord[]): Promise<void> {
24
    // Batch encode records
25
    const encoded = this.encodeRecords(records)
26

27
    // Check if current segment is full
28
    if (this.currentSegment.size + encoded.length > this.maxSegmentSize) {
29
      await this.rotateSegment()
30
    }
31

32
    // Sync write to disk (critical for durability)
33
    await this.currentSegment.write(encoded)
34
    await this.currentSegment.sync() // fsync
35
  }
36

37
  async truncate(checkpointSegmentId: number): Promise<void> {
38
    // Remove segments before checkpoint, keeping minimum
39
    const toRemove = this.segments.filter(
40
      (s) => s.id < checkpointSegmentId && this.segments.length - 1 >= this.minSegmentsKept,
41
    )
42

43
    for (const segment of toRemove) {
44
      await fs.unlink(segment.path)
11 collapsed lines
45
      this.segments = this.segments.filter((s) => s.id !== segment.id)
46
    }
47
  }
48

49
  async replay(): Promise<WALRecord[]> {
50
    // On startup, replay all records from WAL
51
    const records: WALRecord[] = []
52
    for (const segment of this.segments) {
53
      const segmentRecords = await this.readSegment(segment)
54
      records.push(...segmentRecords)
55
    }
56
    return records
57
  }
58
}

In-Memory Index (Series Registry)

The in-memory index maps labels to series IDs for fast query planning:


10 collapsed lines
1
// Inverted index for label lookups
2
// Uses posting lists (sorted arrays of series IDs)
3

4
type SeriesID = number
5
type PostingList = SeriesID[] // Sorted, deduplicated
6

7
class InMemoryIndex {
8
  // Label name → label value → posting list
9
  private postings: Map<string, Map<string, PostingList>> = new Map()
10

11
  // Series ID → labels
12
  private seriesLabels: Map<SeriesID, Map<string, string>> = new Map()
13

14
  // Labels hash → Series ID (for deduplication)
15
  private hashToSeries: Map<bigint, SeriesID> = new Map()
16

17
  private nextSeriesID: SeriesID = 1
18

19
  getOrCreateSeries(labels: Map<string, string>): SeriesID {
20
    const hash = this.hashLabels(labels)
21

22
    const existing = this.hashToSeries.get(hash)
23
    if (existing !== undefined) {
24
      return existing
25
    }
26

27
    // Create new series
28
    const id = this.nextSeriesID++
29
    this.hashToSeries.set(hash, id)
30
    this.seriesLabels.set(id, labels)
31

32
    // Update posting lists
33
    for (const [name, value] of labels) {
34
      if (!this.postings.has(name)) {
35
        this.postings.set(name, new Map())
36
      }
37
      const valueMap = this.postings.get(name)!
38
      if (!valueMap.has(value)) {
39
        valueMap.set(value, [])
40
      }
41
      valueMap.get(value)!.push(id)
42
    }
43

44
    return id
45
  }
46

47
  // Query: find series matching label matchers
48
  query(matchers: LabelMatcher[]): SeriesID[] {
49
    if (matchers.length === 0) {
50
      return Array.from(this.seriesLabels.keys())
51
    }
52

53
    // Start with smallest posting list (most selective)
54
    const sorted = [...matchers].sort((a, b) => this.getPostingListSize(a) - this.getPostingListSize(b))
11 collapsed lines
55

56
    let result = this.getPostingList(sorted[0])
57

58
    // Intersect with remaining matchers
59
    for (let i = 1; i < sorted.length; i++) {
60
      const other = this.getPostingList(sorted[i])
61
      result = this.intersect(result, other)
62

63
      if (result.length === 0) break // Early exit
64
    }
65

66
    return result
67
  }
68

69
  private intersect(a: PostingList, b: PostingList): PostingList {
70
    // Sorted merge intersection: O(n + m)
71
    const result: SeriesID[] = []
72
    let i = 0,
73
      j = 0
74

75
    while (i < a.length && j < b.length) {
76
      if (a[i] === b[j]) {
77
        result.push(a[i])
78
        i++
79
        j++
80
      } else if (a[i] < b[j]) {
81
        i++
82
      } else {
83
        j++
84
      }
85
    }
86

87
    return result
88
  }
89
}

Block Compaction

Compaction merges small blocks into larger ones, improving query efficiency:


10 collapsed lines
1
// Compaction strategy
2
// - Level 0: 2-hour blocks (from memtable flush)
3
// - Level 1: 8-hour blocks (4 level-0 merged)
4
// - Level 2: 24-hour blocks
5
// - Max block size: 31 days OR 10% of retention
6

7
interface CompactionPlan {
8
  sources: Block[]
9
  targetLevel: number
10
  estimatedSize: number
11
}
12

13
class Compactor {
14
  private readonly maxBlockDuration = 31 * 24 * 60 * 60 * 1000 // 31 days
15

16
  async planCompaction(blocks: Block[]): Promise<CompactionPlan[]> {
17
    const plans: CompactionPlan[] = []
18

19
    // Group blocks by level
20
    const byLevel = this.groupByLevel(blocks)
21

22
    for (const [level, levelBlocks] of byLevel) {
23
      // Find adjacent blocks that can be merged
24
      const sorted = levelBlocks.sort((a, b) => a.minTime - b.minTime)
25

26
      for (let i = 0; i < sorted.length - 1; i++) {
27
        const candidates = [sorted[i]]
28
        let j = i + 1
29

30
        while (j < sorted.length && sorted[j].minTime === candidates[candidates.length - 1].maxTime) {
31
          candidates.push(sorted[j])
32
          j++
33

34
          // Check if merged duration exceeds max
35
          const duration = candidates[j - 1].maxTime - candidates[0].minTime
36
          if (duration > this.maxBlockDuration) break
37

38
          // Typical: merge 4 blocks at a time
39
          if (candidates.length >= 4) break
40
        }
41

42
        if (candidates.length >= 2) {
43
          plans.push({
44
            sources: candidates,
45
            targetLevel: level + 1,
46
            estimatedSize: candidates.reduce((sum, b) => sum + b.size, 0) * 0.9,
47
          })
48
          i = j - 1 // Skip merged blocks
49
        }
11 collapsed lines
50
      }
51
    }
52

53
    return plans
54
  }
55

56
  async compact(plan: CompactionPlan): Promise<Block> {
57
    // 1. Open iterators for all source blocks
58
    const iterators = plan.sources.map((b) => b.createIterator())
59

60
    // 2. Merge-sort by (seriesID, timestamp)
61
    const mergedSamples = this.mergeSortIterators(iterators)
62

63
    // 3. Write to new block with fresh compression
64
    const newBlock = await this.writeBlock(mergedSamples, plan.targetLevel)
65

66
    // 4. Update block references atomically
67
    await this.swapBlocks(plan.sources, newBlock)
68

69
    // 5. Delete source blocks
70
    for (const source of plan.sources) {
71
      await source.delete()
72
    }
73

74
    return newBlock
75
  }
76
}

Downsampling

Downsampling creates aggregated views for long-term queries:


10 collapsed lines
1
// Downsampling tiers
2
// - Raw: 10-second resolution, 15 days
3
// - 1-minute: min/max/sum/count/avg, 90 days
4
// - 1-hour: min/max/sum/count/avg, 2 years
5

6
interface DownsampleConfig {
7
  sourceResolution: string // "raw", "1m", "1h"
8
  targetResolution: string
9
  aggregations: string[] // ["min", "max", "sum", "count"]
10
  retention: number // milliseconds
11
}
12

13
interface DownsampledSample {
14
  timestamp: number // Bucket start time
15
  min: number
16
  max: number
17
  sum: number
18
  count: number
19
  // Derived: avg = sum / count
20
}
21

22
class Downsampler {
23
  async downsample(seriesId: number, sourceBlock: Block, config: DownsampleConfig): Promise<DownsampledSample[]> {
24
    const bucketMs = this.parseDuration(config.targetResolution)
25
    const buckets = new Map<number, DownsampledSample>()
26

27
    const iterator = sourceBlock.createIterator(seriesId)
28

29
    for await (const sample of iterator) {
30
      const bucketStart = Math.floor(sample.timestamp / bucketMs) * bucketMs
31

32
      if (!buckets.has(bucketStart)) {
33
        buckets.set(bucketStart, {
34
          timestamp: bucketStart,
35
          min: sample.value,
36
          max: sample.value,
37
          sum: sample.value,
38
          count: 1,
39
        })
40
      } else {
41
        const bucket = buckets.get(bucketStart)!
42
        bucket.min = Math.min(bucket.min, sample.value)
43
        bucket.max = Math.max(bucket.max, sample.value)
44
        bucket.sum += sample.value
11 collapsed lines
45
        bucket.count++
46
      }
47
    }
48

49
    return Array.from(buckets.values())
50
  }
51
}
52

53
// Query routing: select appropriate resolution based on query range
54
function selectResolution(queryRange: number): string {
55
  if (queryRange <= 24 * 60 * 60 * 1000) {
56
    // ≤ 1 day
57
    return "raw"
58
  } else if (queryRange <= 30 * 24 * 60 * 60 * 1000) {
59
    // ≤ 30 days
60
    return "1m"
61
  } else {
62
    return "1h"
63
  }
64
}

Cardinality Management

High cardinality is the primary operational challenge in TSDBs:


8 collapsed lines
1
// Cardinality tracking and enforcement
2

3
interface CardinalityLimits {
4
  maxSeriesPerMetric: number // e.g., 100,000
5
  maxTotalSeries: number // e.g., 10,000,000
6
  maxLabelValueCardinality: number // e.g., 10,000 per label name
7
}
8

9
class CardinalityTracker {
10
  private seriesPerMetric: Map<string, Set<number>> = new Map()
11
  private seriesPerLabel: Map<string, Map<string, number>> = new Map()
12
  private totalSeries: number = 0
13

14
  checkLimits(labels: Map<string, string>, limits: CardinalityLimits): CardinalityError | null {
15
    const metricName = labels.get("__name__")!
16

17
    // Check per-metric cardinality
18
    const metricSeries = this.seriesPerMetric.get(metricName)?.size ?? 0
19
    if (metricSeries >= limits.maxSeriesPerMetric) {
20
      return {
21
        type: "metric_cardinality_exceeded",
22
        metric: metricName,
23
        current: metricSeries,
24
        limit: limits.maxSeriesPerMetric,
25
      }
26
    }
27

28
    // Check per-label cardinality (detect high-cardinality labels)
29
    for (const [name, value] of labels) {
30
      const labelValues = this.seriesPerLabel.get(name)
31
      if (labelValues && labelValues.size >= limits.maxLabelValueCardinality) {
32
        return {
33
          type: "label_cardinality_exceeded",
34
          label: name,
35
          current: labelValues.size,
36
          limit: limits.maxLabelValueCardinality,
37
        }
38
      }
39
    }
11 collapsed lines
40

41
    // Check total cardinality
42
    if (this.totalSeries >= limits.maxTotalSeries) {
43
      return {
44
        type: "total_cardinality_exceeded",
45
        current: this.totalSeries,
46
        limit: limits.maxTotalSeries,
47
      }
48
    }
49

50
    return null
51
  }
52

53
  // Identify high-cardinality labels for alerting
54
  getHighCardinalityLabels(threshold: number): { label: string; cardinality: number }[] {
55
    const results: { label: string; cardinality: number }[] = []
56

57
    for (const [label, values] of this.seriesPerLabel) {
58
      if (values.size > threshold) {
59
        results.push({ label, cardinality: values.size })
60
      }
61
    }
62

63
    return results.sort((a, b) => b.cardinality - a.cardinality)
64
  }
65
}
66

67
// Common high-cardinality antipatterns to detect:
68
// - request_id, trace_id, user_id as labels (use exemplars instead)
69
// - IP addresses, URLs with query strings
70
// - Timestamps or UUIDs in labels

Frontend Considerations

Dashboard Query Optimization

Problem: A dashboard with 20 panels, each querying 1 week of data at 1-minute resolution, generates 20 × 10,080 = 201,600 datapoints.

Solutions:

Step size alignment: Query at dashboard refresh rate, not max resolution
Parallel queries: Fetch all panels concurrently
Query caching: Cache results for identical queries
Streaming updates: WebSocket for live data, HTTP for historical


5 collapsed lines
1
// Dashboard query batching
2
interface DashboardQuery {
3
  panelId: string
4
  expr: string
5
  range: { start: number; end: number }
6
  step: number
7
}
8

9
async function fetchDashboard(queries: DashboardQuery[]): Promise<Map<string, QueryResult>> {
10
  // Group queries by time range (likely the same for most panels)
11
  const byRange = groupBy(queries, (q) => `${q.range.start}-${q.range.end}`)
12

13
  const results = new Map<string, QueryResult>()
14

15
  // Fetch each range group in parallel
16
  await Promise.all(
17
    Object.entries(byRange).map(async ([rangeKey, rangeQueries]) => {
18
      // Batch multiple expressions if backend supports it
19
      const batchResult = await fetchBatch(rangeQueries)
20

21
      for (const [panelId, result] of batchResult) {
22
        results.set(panelId, result)
23
      }
24
    }),
25
  )
26

27
  return results
28
}

Time Range Selection

UX considerations:

Relative ranges (“Last 1 hour”, “Last 7 days”) for monitoring
Absolute ranges for incident investigation
Auto-refresh intervals matching query range
Timezone handling (display in user’s local time)

Graph Rendering

Performance patterns:

Downsample client-side for display (canvas can’t render 10K points)
Use WebGL for high-density graphs (Grafana uses uPlot)
Lazy load panels as they scroll into view
Debounce zoom/pan operations

Infrastructure Design

Cloud-Agnostic Concepts

Component	Requirement	Options
Ingest/Query	Stateless, auto-scaling	Kubernetes Deployment, ECS Service
Storage Engine	Stateful, local SSD	StatefulSet with local volumes
Block Storage	High IOPS, low latency	Local NVMe, EBS gp3, Persistent Disks
Object Storage	Cold tier, cheap, durable	S3, GCS, MinIO
Coordination	Leader election, config	etcd, Consul, ZooKeeper

AWS Reference Architecture

Component	AWS Service	Configuration
Load Balancer	NLB	TCP passthrough, cross-AZ
Distributors	EKS (Fargate)	3-10 pods, 2 vCPU / 4GB each
Ingesters	EKS (EC2)	i3.xlarge (NVMe), 3+ nodes, StatefulSet
Query Engines	EKS (Fargate)	2-10 pods, 4 vCPU / 8GB each
Cold Storage	S3 Standard-IA	Lifecycle to Glacier after 90 days
Query Cache	ElastiCache Redis	cache.r6g.large, cluster mode
Coordination	Managed etcd (EKS)	Built into EKS control plane

Self-Hosted Alternatives

Managed Service	Self-Hosted	When to Self-Host
ElastiCache	Redis on EC2	Cost, specific modules
S3	MinIO	On-premises, data sovereignty
EKS	k3s/k8s on EC2	Cost, full control

High Availability

Replication strategy for ingesters:

Write to N ingesters (N=3 typical)
Query reads from any replica
Consistency: eventual (samples may appear with delay)
Durability: WAL replicated, blocks uploaded to S3

Failure scenarios:

Ingester failure: WAL replayed on replacement, recent samples recovered from peers
Query engine failure: Stateless, load balancer routes to healthy instances
S3 unavailable: Query hot data from ingesters, fail cold queries gracefully

Conclusion

This design prioritizes write efficiency (LSM + WAL), storage efficiency (Gorilla compression), and query performance (inverted index + time partitioning). The architecture scales from a single node (Prometheus-style) to a distributed cluster (Cortex/M3-style).

Key architectural decisions:

LSM-based storage with 2-hour blocks: Balances write amplification against query efficiency. Compaction merges blocks up to 31 days.
Gorilla compression (delta-of-delta + XOR): Achieves 11x compression (1.46 bytes/sample). 96% of timestamps compress to 1 bit with regular intervals.
Inverted index with posting list intersection: Label queries use sorted posting lists for O(n+m) intersection. Memory-mapped for large cardinalities.
Tiered retention with downsampling: Raw data (15 days) → 1-minute aggregates (90 days) → 1-hour aggregates (2 years). Queries auto-select appropriate resolution.
Cardinality limits as first-class feature: Per-metric, per-label, and total series limits prevent unbounded memory growth.

Limitations and future improvements:

Cross-series queries: Aggregations across many series are expensive. Pre-computed recording rules help.
Exemplars and traces: Linking metrics to traces requires additional storage (not covered in depth).
Multi-tenancy isolation: Resource limits, query prioritization, and billing require additional infrastructure.

Appendix

Prerequisites

Distributed systems fundamentals (consistency, availability, partitioning)
Database internals (LSM trees, B-trees, write-ahead logs)
Data compression concepts
Basic familiarity with metrics and monitoring

Terminology

Sample: A (timestamp, value) pair for a specific series
Series: A unique metric identified by its name and label set
Cardinality: The number of unique series (metric name + tag combinations)
TSDB: Time Series Database—optimized for time-indexed, append-only data
LSM Tree: Log-Structured Merge Tree—storage structure using sorted runs and compaction
WAL: Write-Ahead Log—sequential log for durability before in-memory commit
Gorilla: Facebook’s compression algorithm for time-series (VLDB 2015)
PromQL: Prometheus Query Language—functional query language for metrics
Posting List: Sorted list of series IDs matching a label value

Summary

TSDBs exploit append-only, time-ordered access patterns for 10x+ compression and fast range queries
Gorilla compression (delta-of-delta + XOR) achieves 1.46 bytes/sample, with 96% of timestamps compressing to 1 bit
LSM storage with 2-hour blocks balances write throughput against query efficiency; compaction merges up to 31-day blocks
Inverted index maps labels to series IDs; posting list intersection enables efficient label queries
Cardinality management is critical—high-cardinality labels (request_id, user_id) must be blocked or aggregated
Tiered retention with downsampling (raw → 1m → 1h) enables cost-effective multi-year storage
Scale to 1M+ samples/sec per node with proper tuning; distributed architectures (M3, Cortex) reach 1B+ samples/sec

References

Gorilla: A Fast, Scalable, In-Memory Time Series Database - Facebook VLDB 2015 paper on compression
Prometheus TSDB Documentation - Block format, WAL, compaction
InfluxDB Storage Engine (TSM) - TSM and TSI architecture
M3: Uber’s Open Source, Large-scale Metrics Platform - Distributed TSDB at scale
Cortex Architecture - Multi-tenant Prometheus
TimescaleDB: SQL for Time-Series - Hypertables and continuous aggregates
VictoriaMetrics vs Prometheus - Performance comparison
QuestDB Architecture - Column-oriented TSDB
PromQL Documentation - Query language reference