16 min read Last updated on Feb 3, 2026

Distributed Monitoring Systems

Designing observability infrastructure for metrics, logs, and traces: understanding time-series databases, collection architectures, sampling strategies, and alerting systems that scale to billions of data points.

Mermaid diagram
Overview of distributed monitoring architecture showing collection methods, telemetry types, and storage backends

Distributed monitoring is fundamentally about trade-offs between resolution, cost, and queryability. The core tensions:

  • Cardinality vs. cost: Every unique label combination creates a time series. A single metric with user_id labels across 1M users generates 1M series—each consuming memory and storage.
  • Sampling vs. completeness: At Google’s scale, Dapper samples 1 in 1,024 traces. You’ll miss rare issues but avoid drowning in data.
  • Aggregation vs. detail: Metrics provide fast, cheap aggregates but lose individual request context. Traces preserve full context but cost 100-1000× more per request.

The key mental model:

SignalCardinalityCost/EventQuery SpeedBest For
MetricsMust be low (<10K series/metric)~0.001¢MillisecondsDashboards, alerts, SLOs
LogsUnlimited~0.1¢SecondsDebugging, audit trails
TracesUnlimited (but sample)~1¢SecondsRequest flow analysis

Modern observability links these signals: exemplars connect metric spikes to specific traces; trace IDs in logs enable correlation. The goal is starting from a metric alert, drilling into relevant traces, and finding the specific log line—without pre-aggregating everything.

A counter is a monotonically increasing value that resets to zero on restart. Use for: request counts, bytes transferred, errors.

Critical behavior: Counters must be queried with rate() or increase(), not raw values. A counter showing 1,000,000 tells you nothing—rate(requests_total[5m]) showing 500/sec is actionable.

Why monotonic? If counters could decrease, detecting resets would be ambiguous. The monotonic constraint means any decrease signals a restart, allowing monitoring systems to handle resets correctly.

# Bad: querying raw counter
requests_total{service="api"} # Returns 847293847
# Good: rate over time
rate(requests_total{service="api"}[5m]) # Returns 127.3/sec

A gauge is a value that can increase or decrease. Use for: temperature, queue depth, active connections, memory usage.

Common mistake: Using gauges for values that should be counters. If you’re tracking “requests since startup” as a gauge, you lose the ability to detect resets and calculate rates correctly.

A histogram samples observations and counts them in configurable buckets. Use for: request latency, response sizes.

Prometheus histogram:

  • Cumulative buckets: le="0.1" includes all requests ≤100ms
  • Enables percentile calculation via histogram_quantile()
  • Trade-off: Bucket boundaries are static; wrong boundaries lose precision
# Request latency histogram with buckets at 10ms, 50ms, 100ms, 500ms, 1s
http_request_duration_seconds_bucket{le="0.01"} 2451
http_request_duration_seconds_bucket{le="0.05"} 8924
http_request_duration_seconds_bucket{le="0.1"} 12847
http_request_duration_seconds_bucket{le="0.5"} 15234
http_request_duration_seconds_bucket{le="1"} 15401
http_request_duration_seconds_bucket{le="+Inf"} 15523
http_request_duration_seconds_sum 892.47
http_request_duration_seconds_count 15523

OpenTelemetry exponential histogram: Uses base-2 exponential bucket boundaries computed from a scale factor. At scale 0, boundaries are powers of 2: (0.25, 0.5], (0.5, 1], (1, 2], (2, 4]. The default 160 buckets cover 1ms to 100s with <5% relative error.

Why exponential? Explicit histograms require N-1 boundary values (8 bytes each). Exponential histograms need only scale + offset (2 values), reducing storage while maintaining precision across wide ranges.

A summary calculates streaming quantiles client-side. Use when: exact percentiles are critical and aggregation across instances is not needed.

Critical limitation: Summaries cannot be aggregated. The p99 of three services’ p99s is mathematically meaningless. For multi-instance services, use histograms.

FeatureHistogramSummary
Aggregation Can aggregate across instances Cannot aggregate
Percentile accuracyApproximate (bucket-dependent)Exact (within time window)
Server costLow (just increment counters)Higher (maintains streaming state)
ConfigurationBucket boundariesQuantile targets, time window

Cardinality = unique combinations of metric name × label values. A metric with no labels = 1 series. Add labels:

http_requests_total{method, status, endpoint, user_id}

With 5 methods × 50 statuses × 100 endpoints × 1M users = 25 billion series. This will crash any time-series database.

Design rules:

  1. Never use unbounded labels: user_id, request_id, trace_id, email, IP address
  2. Target <10 cardinality per label: If a label could grow to 100+ values, reconsider
  3. Move high-cardinality data to logs/traces: They’re designed for it

Cardinality explosion symptoms:

  • Prometheus OOM crashes during compaction
  • Query timeouts on dashboards
  • Scrape duration exceeding scrape interval
  • Memory usage growing faster than data volume

Prometheus scrapes HTTP endpoints at configured intervals (default: 15s-60s). Targets expose metrics at /metrics.

Design rationale (Brian Brazil, Prometheus creator): “Pull is very slightly better… from an engineering standpoint, push vs pull largely doesn’t matter. But pull gives you target health visibility—you immediately know when a target is down.”

Advantages:

  • Target health for free: Failed scrapes surface immediately via up metric
  • Debuggable: curl the /metrics endpoint manually
  • Centralized control: Change scrape intervals without redeploying apps
  • Natural rate limiting: Prometheus controls load on targets
  • Graceful degradation: If Prometheus is down, apps are unaffected

Disadvantages:

  • Firewall challenges: Prometheus must reach targets (problematic for NAT, serverless)
  • Short-lived jobs: If a job dies before the next scrape, metrics are lost
  • Service discovery required: Must know all targets

Scale reference: A single Prometheus server handles ~800,000 samples/second with 10-second scrape intervals across 10,000+ machines.

Applications push metrics to a central collector.

Advantages:

  • Works across firewalls: Outbound connections only
  • Supports ephemeral workloads: Serverless, batch jobs, short-lived containers
  • Event-driven metrics: Can push immediately on significant events

Disadvantages:

  • No automatic target health: Must implement health checks separately
  • Backpressure complexity: What happens when the collector is slow?
  • Thundering herd: All apps pushing simultaneously after restart

For batch jobs or serverless functions that can’t be scraped:

# Job pushes metrics before exiting
echo "batch_job_duration_seconds 47.3" | curl --data-binary @- \
http://pushgateway:9091/metrics/job/nightly_report

Critical limitation: Pushgateway never forgets. Metrics persist until manually deleted. A job that ran once months ago still appears in queries.

When to use: Only for truly short-lived jobs (<1 scrape interval). Not for long-running services that happen to be behind a firewall—use service mesh or federation instead.

A vendor-neutral pipeline for receiving, processing, and exporting telemetry.

receivers: # OTLP, Prometheus, Jaeger, etc.
processors: # Batching, sampling, filtering, enrichment
exporters: # Prometheus, Jaeger, vendor backends

Deployment patterns:

PatternUse CaseTrade-off
SidecarPer-pod in KubernetesMaximum isolation, highest resource overhead
DaemonSetPer-node agentBalanced resource use, node-level correlation
GatewayCentralized clusterLowest overhead, single point of failure

Tail sampling architecture: To make sampling decisions after trace completion, you need two collector tiers:

  1. Load-balancing tier: Routes all spans with same trace_id to same collector
  2. Sampling tier: Buffers complete traces, applies policies, samples

This requires consistent hashing on trace_id—if spans scatter across collectors, you can’t make trace-level decisions.

The Gorilla paper introduced compression achieving 1.37 bytes per sample (down from 16 bytes for timestamp + float64).

Timestamp compression (delta-of-delta):

Most metrics arrive at fixed intervals. If scraping every 15 seconds:

Timestamps: 1000, 1015, 1030, 1045
Deltas: 15, 15, 15
Delta-of-delta: 0, 0

When delta-of-delta = 0, store a single bit. Result: 96% of timestamps compress to 1 bit.

Value compression (XOR):

Consecutive metric values are often similar. XOR reveals only changed bits:

Value 1: 0x4059000000000000 (100.0)
Value 2: 0x4059100000000000 (100.25)
XOR: 0x0000100000000000

Store only the position and length of meaningful bits. 51% of values compress to 1 bit (identical to previous).

Write path:

  1. Head block (in-memory): Last ~2 hours of data
  2. WAL (Write-Ahead Log): 128MB segments for crash recovery
  3. Memory-mapped chunks: Full chunks flushed to disk, memory-mapped back (since v2.19)
  4. Compaction: Head block → persistent block every 2 hours

Block structure:

01BKGTZQ1WNWHNJNAC82 (block directory)
├── meta.json # Block metadata, time range
├── index # Inverted index: labels → series IDs
├── chunks/ # Compressed time-series data
│ └── 000001
└── tombstones # Deletion markers

Index design: Inverted index using roaring bitmaps. Query {job="api", status="500"} intersects the bitmap for job=api with status=500 to find matching series IDs.

Compaction trade-offs:

  • Merges blocks for better compression and query performance
  • Uses significant CPU/memory during compaction
  • Can cause OOM on high-cardinality setups

VictoriaMetrics achieves 7× less storage than Prometheus for the same data through:

  • Better compression: Custom algorithms beyond Gorilla
  • Optimized inverted index: Handles high cardinality more gracefully
  • Memory efficiency: 10× less RAM than InfluxDB for millions of series

Cluster architecture (shared-nothing):

  • vminsert: Stateless, receives data, distributes via consistent hashing
  • vmstorage: Stateful, stores data, handles queries for its partition
  • vmselect: Stateless, queries all vmstorage nodes, merges results

vmstorage nodes don’t communicate with each other—this simplifies operations but means queries must fan out to all nodes.

Netflix’s Atlas processes 1 billion+ metrics per minute with:

  • In-memory storage for the query hot path (last 2 weeks)
  • Inverted index using roaring bitmaps (like Lucene)
  • Streaming evaluation for alerts (20× more alert queries than polling could support)

Dimensional explosion example: One Netflix service with dimensions for device (~1,000 types) × country (~50) = 50,000 series per node × 100 nodes = 5 million series for one metric. This is normal at Netflix’s scale.

A span represents a single operation: an HTTP request handler, a database query, a function call. Spans contain:

  • Trace ID (shared across all spans in the trace)
  • Span ID (unique to this span)
  • Parent span ID (forms the tree structure)
  • Start time, duration
  • Tags/attributes (key-value metadata)
  • Events/logs (timestamped annotations)

A trace is the tree of spans representing a complete request flow across services.

The W3C Trace Context specification defines propagation headers:

traceparent (required):

00-0af7651916cd43dd8448eb211c80319c-b9c7c989f97918e1-01
│ │ │ │
│ │ │ └─ Trace flags (sampled)
│ │ └─ Parent span ID (16 hex chars)
│ └─ Trace ID (32 hex chars)
└─ Version (always 00)

tracestate (optional): Vendor-specific key-value pairs, max 32 entries:

tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7

Design rationale: No hyphens in header names (traceparent not trace-parent) because trace context propagates beyond HTTP—through message queues, databases, etc.—where hyphenated names cause issues.

At scale, tracing everything is impossible. Google’s Dapper samples 1 in 1,024 traces by default.

Head-based sampling: Decision made at trace start.

  • Simple, stateless
  • No buffering required
  • Might miss interesting traces (errors, slow requests)

Tail-based sampling: Decision made after trace completion.

  • Can keep all errors, slow traces, specific paths
  • More intelligent sampling policies
  • Requires buffering complete traces
  • Needs consistent routing (all spans to same collector)

Tail sampling policies (OpenTelemetry Collector supports 13+):

PolicyKeep When
LatencyTrace duration > threshold
Status codeAny span has error status
Rate limitingN traces per second
ProbabilisticRandom % of traces
String attributeSpecific attribute values match

Hybrid approach: For very high volume, use head-based sampling first (keep 10%), then tail-based sampling on that subset for intelligent filtering.

Dapper’s design principles (from the 2010 paper):

  1. Ubiquitous deployment: Instrument ~1,500 lines of C++ in the RPC library—no application changes
  2. Low overhead: Sampling keeps CPU impact negligible
  3. Application transparency: Developers don’t need to think about tracing

Data flow:

  1. Spans written to local log files
  2. Daemons collect logs, write to BigTable
  3. Each row = one trace, each column = one span
  4. Median latency: 15 seconds from span creation to queryable

Two-level sampling:

  1. Trace sampling (in application): 1/1,024 traces instrumented
  2. Collection sampling (in daemon): Additional filtering before BigTable

“They leverage the fact that all spans for a given trace share a common trace id. For each span seen, they hash the trace id… either sample or discard entire traces rather than individual spans.”

Alertmanager processes alerts from Prometheus through:

  1. Dispatcher: Routes alerts based on label matchers
  2. Inhibition: Suppresses alerts when related alerts are firing
  3. Silencing: Mutes alerts during maintenance windows
  4. Grouping: Batches related alerts into single notifications
  5. Notification: Sends to receivers (Slack, PagerDuty, email)

Grouping example: During an outage affecting 100 pods, you want one page “100 pods down in cluster-east” not 100 separate pages.

route:
group_by: ["alertname", "cluster"]
group_wait: 30s # Wait for more alerts before sending
group_interval: 5m # Time between grouped notifications
repeat_interval: 4h # How often to re-send

Inhibition example: If the entire cluster is unreachable, suppress all per-pod alerts:

inhibit_rules:
- source_matchers: [severity="critical", alertname="ClusterDown"]
target_matchers: [severity="warning"]
equal: ["cluster"]

Static thresholds: Alert when error_rate > 1%

  • Predictable, easy to understand
  • Doesn’t adapt to traffic patterns
  • Requires manual tuning per service

Anomaly detection: Alert when value deviates from learned baseline

  • Adapts to patterns (daily cycles, growth)
  • Black box—hard to explain why it alerted
  • Cold start problem (needs historical data)
  • High false positive rates in practice

Production reality: Most teams use static thresholds with percentile-based targets. Anomaly detection works for specific use cases (fraud detection, capacity planning) but not general alerting.

Traditional alerting on symptoms (“error rate > 1%”) has problems:

  • What’s the right threshold? 1%? 0.1%?
  • A 1% error rate for 1 minute is different from 1% for 1 hour

Error budget = 1 - SLO. For 99.9% availability: 0.1% error budget per month (~43 minutes).

Burn rate = how fast you’re consuming error budget. Burn rate 1 = exactly on pace. Burn rate 10 = will exhaust budget in 3 days instead of 30.

Multi-window, multi-burn-rate alerts (Google’s recommendation):

Burn RateLong WindowShort WindowSeverityBudget Consumed
14.41 hour5 minPage2% in 1 hour
66 hours30 minPage5% in 6 hours
13 days6 hoursTicket10% in 3 days

Why two windows? The long window catches sustained issues. The short window prevents alerting on issues that already recovered.

Low-traffic caveat: If a service gets 10 requests/hour, 1 failure = 10% error rate = 1,000× burn rate. For low-traffic services, use longer evaluation windows or count-based thresholds.

An exemplar is a trace ID attached to a specific metric sample. When investigating a latency spike, click through to see an actual slow trace.

How it works:

# Prometheus metric with exemplar
http_request_duration_seconds_bucket{le="0.5"} 24054 # {trace_id="abc123"} 0.47

The metric records 24,054 requests ≤500ms. The exemplar says “here’s one specific request at 470ms with trace_id abc123.”

Requirements:

  • Client library must support exemplars (OpenTelemetry does)
  • TSDB must store exemplars (Prometheus 2.25+)
  • Visualization must display and link them (Grafana)

Include trace_id and span_id in structured logs:

{
"timestamp": "2024-01-15T10:23:45Z",
"level": "error",
"message": "Database connection timeout",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b9c7c989f97918e1",
"service": "order-service"
}

Now Grafana can show “all logs for this trace” with a single click from the trace view.

Service metrics (RED): Rate, Errors, Duration computed from spans

Benefits:

  • No separate instrumentation—traces generate metrics automatically
  • Consistent definitions across services
  • Enables drill-down from metric to trace

Trade-off: Trace-derived metrics have higher latency (spans must be processed) and depend on sampling rates for accuracy.

FactorPull (Prometheus)Push (OTLP)Hybrid (Collector)
Target health visibility Built-in up metric Requires separate checks Via receiver health
Works through firewalls Needs network access Outbound only Configurable
Short-lived jobs May miss Push before exit Via Pushgateway
Configuration locationCentralizedPer-applicationCentralized

Default recommendation: Pull for long-running services (simpler, health built-in). Push via OpenTelemetry Collector for serverless/ephemeral workloads.

FactorPrometheusVictoriaMetricsManaged (Datadog, etc.)
CostInfrastructure onlyInfrastructure onlyPer-metric pricing
Operational burdenModerateLow-moderateNone
High availabilityFederation/ThanosBuilt-in clusteringBuilt-in
Cardinality limits~10M series~100M seriesSoft limits + overage

Scale thresholds:

  • <1M active series: Single Prometheus
  • 1-10M series: Prometheus + remote write to long-term storage
  • 10M+ series: VictoriaMetrics cluster or managed service
ApproachOverheadCapture RateComplexity
Head 100%HighestAll tracesSimple
Head 1%Low1% randomSimple
Tail-basedMediumAll errors + sampledHigh (requires routing)
HybridLowBest of bothHighest

Default recommendation: Start with head-based 10-100% sampling. Move to tail-based when you need to capture all errors or have volume requiring <1% sampling.

ApproachFalse PositivesActionabilitySetup Effort
Static thresholdsMediumHigh (clear trigger)Low
Anomaly detectionHighMedium (why alert?)Medium
SLO burn rateLowHigh (budget impact clear)High

Default recommendation: SLO-based alerting with multi-window burn rates for critical services. Static thresholds for simpler cases where SLO definition is unclear.

The mistake: Adding request_id, user_id, or IP address as metric labels.

Why it happens: Developers want to query metrics by these dimensions. It works in development with 100 users.

The consequence: Production with 1M users creates 1M series per metric. Prometheus OOMs, queries timeout, alerts stop working.

The fix: High-cardinality dimensions belong in logs and traces, not metrics. If you need user-level metrics, use a logging pipeline with columnar storage (ClickHouse, BigQuery).

The mistake: alert: requests_total > 1000000

Why it happens: The counter is visible in Grafana, the number is big, seems like a problem.

The consequence: Alert fires based on how long the service has been running, not actual request rate. Restarts “fix” the alert.

The fix: Always use rate() or increase() with counters:

# Alert on request rate, not total
alert: HighRequestRate
expr: rate(requests_total[5m]) > 1000

The mistake: Head-based 1% sampling drops 99% of errors.

Why it happens: Error rate is 0.1%. Of that 0.1%, you keep 1% = 0.001% of requests are error traces you can analyze.

The consequence: When debugging production issues, no error traces are available.

The fix: Tail-based sampling with error retention policy, or hybrid approach:

# OpenTelemetry Collector tail sampling
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 1 }

The mistake: Alerting on every metric deviation without proper thresholds or grouping.

Why it happens: Starting with “alert on everything” seems safe. Better to over-alert than miss issues.

The consequence: On-call ignores alerts. Real incidents get lost in noise. Team burns out.

The fix:

  1. SLO-based alerts: Only page when error budget is burning
  2. Aggressive grouping: One alert per incident, not per pod
  3. Proper inhibition: Suppress symptoms when root cause is alerting
  4. Track signal-to-noise ratio: Target >50% actionable alerts

Distributed monitoring is not about collecting more data—it’s about collecting the right data at the right resolution for the right cost.

Key principles:

  1. Metrics for dashboards and alerts (low cardinality, high aggregation)
  2. Logs for debugging (high cardinality, full detail)
  3. Traces for request flow (sampled, correlated)
  4. Correlation is the multiplier: Exemplars, trace IDs in logs, metrics from traces

Start simple:

  • Pull-based metrics with Prometheus
  • Head-based trace sampling at 10%
  • SLO-defined alert thresholds

Scale when needed:

  • Tail-based sampling when error capture matters
  • VictoriaMetrics or managed service when cardinality exceeds 10M series
  • OpenTelemetry Collector when multi-backend or complex processing required

The 1.37 bytes/sample compression from Gorilla and the 1/1,024 sampling rate from Dapper aren’t arbitrary—they represent battle-tested trade-offs from systems handling billions of data points. Understand these trade-offs, and you can design monitoring that scales with your system.

  • Understanding of distributed systems concepts (latency, throughput, consistency)
  • Familiarity with basic statistics (percentiles, rates, distributions)
  • Experience with at least one monitoring system (Prometheus, Datadog, etc.)
  • Cardinality: Number of unique time series (metric name × label value combinations)
  • Scrape: Prometheus pulling metrics from a target’s /metrics endpoint
  • Span: Single operation in a distributed trace (has trace_id, span_id, parent_id)
  • Exemplar: A trace ID attached to a metric sample for correlation
  • Burn rate: Speed of error budget consumption relative to SLO (burn rate 1 = on pace)
  • Head block: In-memory portion of Prometheus TSDB containing recent samples
  • WAL (Write-Ahead Log): Durability mechanism that logs writes before applying them
  • OTLP (OpenTelemetry Protocol): Standard protocol for transmitting telemetry data
  • Cardinality is the primary scaling constraint for metrics—keep labels bounded (<10 unique values per label)
  • Gorilla compression achieves 1.37 bytes/sample through delta-of-delta timestamps and XOR-encoded values
  • Pull-based collection provides target health visibility; push works better for ephemeral workloads
  • Tail-based sampling captures all errors but requires consistent routing and buffering
  • SLO burn-rate alerts with multi-window evaluation reduce false positives and false negatives
  • Exemplars and trace IDs connect metrics, logs, and traces for end-to-end debugging
Continue Reading