Logging, Metrics, and Tracing Fundamentals

Observability in distributed systems rests on three complementary signals: logs capture discrete events with full context, metrics quantify system behavior over time, and traces reconstruct request paths across service boundaries. Each signal answers different questions, and choosing wrong defaults for cardinality, sampling, or retention can render your observability pipeline either useless or prohibitively expensive. This article covers the design reasoning behind each signal type, OpenTelemetry’s unified data model, and the operational trade-offs that determine whether your system remains debuggable at scale.

OpenTelemetry's unified instrumentation model: three signals exported via OTLP, with trace context linking them together.

Abstract

Observability signals differ fundamentally in what they optimize:

Logs preserve full event context at the cost of storage; use them for debugging specific incidents
Metrics sacrifice detail for efficient aggregation; use them for alerting and dashboards
Traces reconstruct causality across services; use them to understand distributed request flow

The critical design decisions:

Cardinality determines cost: metrics with unbounded label values (user IDs, request IDs) will crash your time-series database
Sampling trades completeness for cost: head sampling is predictable but blind to errors; tail sampling captures interesting traces but requires buffering
Context propagation (W3C Trace Context) links all three signals—without consistent TraceId propagation, correlation is manual and error-prone

OpenTelemetry (OTel) provides the current standard for instrumentation, unifying all signals under a single SDK with vendor-neutral export via OTLP (OpenTelemetry Protocol).

Observability Signal Taxonomy

The three observability signals answer fundamentally different questions:

Signal	Question Answered	Cardinality	Storage Cost	Query Pattern
Logs	”What happened in this request?”	Unbounded	Highest	Full-text search, filter
Metrics	”How is the system behaving?”	Bounded	Lowest	Aggregation, rate
Traces	”How did this request flow?”	Sampled	Medium	Trace ID lookup, DAG

Why Three Signals?

Each signal represents a different trade-off between detail and scalability:

Logs preserve everything—the full request body, stack traces, variable values—but this completeness makes them expensive to store and slow to query. You cannot aggregate logs efficiently; searching requires scanning text.

Metrics discard detail to enable aggregation. A counter tracking http_requests_total can be summed across instances, compared over time windows, and stored in constant space per label combination. But you cannot reconstruct individual requests from metrics.

Traces bridge the gap by preserving request structure (which services were called, in what order, how long each took) while sampling to control cost. A trace lets you understand causality—why a request was slow—but sampling means you might miss the specific request causing a user complaint.

OpenTelemetry Data Model

As of OpenTelemetry 1.x (stable since 2023), all three signals share common context:

Traces consist of Spans organized into a directed acyclic graph (DAG). Each Span contains:

TraceId (16 bytes): Unique identifier for the entire trace
SpanId (8 bytes): Unique identifier for this span
ParentSpanId: Links to parent span (empty for root)
Name: Operation name (e.g., HTTP GET /users)
StartTime, EndTime: Microsecond timestamps
Status: OK, Error, or Unset
Attributes: Key-value metadata (bounded cardinality)
Events: Timestamped annotations within the span
Links: References to related spans (e.g., batch job linking to triggering requests)

Metrics use a temporal data model with three primary types:

Counter: Monotonically increasing value (requests, bytes, errors)
Gauge: Point-in-time measurement (CPU%, queue depth, active connections)
Histogram: Distribution across buckets (latency percentiles)

Logs in OpenTelemetry include:

Timestamp: When the event occurred
SeverityNumber/SeverityText: Log level
Body: The log message (string or structured)
Attributes: Additional context
TraceId, SpanId: Links to active trace (automatic correlation)

The critical design choice: logs automatically inherit TraceId and SpanId from the active span context. This enables trace-log correlation without manual instrumentation.

Structured Logging Practices

Structured logging replaces free-form text with machine-parseable records, typically JSON. The design reasoning: logs must be queryable at scale, and structured formats enable indexing, filtering, and aggregation that text logs cannot support.

Log Schema Design

A well-designed log schema includes:


2 collapsed lines
1
// Example structured log entry
2
{
3
  "timestamp": "2024-01-15T10:23:45.123Z",
4
  "level": "ERROR",
5
  "service": "payment-service",
6
  "trace_id": "abc123def456",
7
  "span_id": "789xyz",
8
  "message": "Payment processing failed",
9
  "error.type": "TimeoutException",
10
  "error.message": "Gateway timeout after 30s",
11
  "http.method": "POST",
12
  "http.url": "/api/v1/payments",
13
  "payment.amount": 99.99,
2 collapsed lines
14
  "payment.currency": "USD"
15
}

Schema principles:

Flat structure: Nested JSON (more than 2-3 levels) degrades query performance in most log backends
Consistent naming: Use OpenTelemetry Semantic Conventions (http.method, error.type) for cross-service correlation
Bounded values: Fields like http.status_code have finite cardinality; fields like user_id do not

Log Levels and Semantics

Log levels communicate actionability, not importance:

Level	Semantics	Production Use
`ERROR`	Actionable failure requiring investigation	Alert, page on-call
`WARN`	Degraded condition, retry succeeded, missing config	Monitor, investigate if frequent
`INFO`	Business events, state transitions, request lifecycle	Normal operations
`DEBUG`	Diagnostic detail, variable values	Disabled in production
`TRACE`	Extremely verbose diagnostics	Never in production

Design reasoning: If every ERROR log does not warrant investigation, your log levels are miscalibrated. Teams that log expected conditions as ERROR (e.g., user validation failures) train themselves to ignore errors.

Cardinality in Logs

Log cardinality matters less than metric cardinality for storage—logs are already unbounded by nature—but it critically affects query performance and indexing costs.

High-cardinality fields (user IDs, request IDs, session tokens):

Store but do not index by default
Use full-text search for ad-hoc queries
Index only if frequently filtered

Indexing strategy:

1
Index: timestamp, level, service, trace_id, http.status_code
2
No index: user_id, request_body, stack_trace

Performance Implications

Logging overhead in production applications:

Synchronous logging: 0.5-2ms per entry (disk I/O bound)
Asynchronous buffered logging: 0.01-0.1ms per entry (memory buffer, batch flush)
Network logging: 0.1-0.5ms per entry (depends on batching)

For high-throughput services (>10K requests/second), synchronous logging is not viable. Use async logging with:

Ring buffer for backpressure (drop oldest on overflow)
Batch size of 100-1000 entries
Flush interval of 100-500ms

Metrics Design and Aggregation

Metrics enable the aggregation that logs cannot support. The fundamental constraint: metrics work because they have bounded cardinality. Each unique label combination creates a new time series, and time-series databases (TSDBs) store each series independently.

Metric Types

Counter: Cumulative value that only increases (or resets to zero on restart).

1
http_requests_total{method="GET", status="200"} 150432
2
http_requests_total{method="GET", status="500"} 42
3
http_requests_total{method="POST", status="201"} 8923

Query pattern: rate(http_requests_total[5m]) gives requests per second.

Gauge: Point-in-time measurement that can increase or decrease.

1
active_connections{pool="primary"} 47
2
cpu_usage_percent{core="0"} 72.3
3
queue_depth{queue="orders"} 128

Query pattern: instant value or average over time window.

Histogram: Distribution of observations across predefined buckets.

1
http_request_duration_seconds_bucket{le="0.1"} 24054
2
http_request_duration_seconds_bucket{le="0.25"} 32847
3
http_request_duration_seconds_bucket{le="0.5"} 34012
4
http_request_duration_seconds_bucket{le="1.0"} 34500
5
http_request_duration_seconds_bucket{le="+Inf"} 34567
6
http_request_duration_seconds_count 34567
7
http_request_duration_seconds_sum 8234.56

Query pattern: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) gives p95 latency.

Design reasoning for histograms: Buckets are cumulative (le = “less than or equal”) because cumulative buckets enable accurate percentile calculation during aggregation. Non-cumulative buckets lose information when summed across instances.

Bucket Configuration

Histogram bucket boundaries should align with your SLOs (Service Level Objectives):

1
# Latency histogram for API with 200ms p99 SLO
2
buckets: [0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5, 5.0, 10.0]
3
         ^^^^^          ^^^^^^^^^^^^  ^^^^^^^^^^^^^^^
4
         fine-grained   SLO region    tail latency

Trade-off: More buckets = better resolution but more time series (one per bucket per label combination). For a metric with 5 labels averaging 10 values each: 5 labels × 10 values × 15 buckets = 7,500 time series per metric.

Cardinality Explosion

The most common operational failure in metrics systems: unbounded label values crash your TSDB.

Example of cardinality explosion:

1
# DANGEROUS: user_id has unbounded cardinality
2
http_requests_total{user_id="u123", endpoint="/api/orders"}
3

4
# With 1M users and 50 endpoints:
5
# 1,000,000 × 50 = 50,000,000 time series

Prometheus holds time series in memory. At ~3KB per series, 50M series requires 150GB RAM—exceeding most deployments.

Symptoms of cardinality problems:

Prometheus memory usage grows unboundedly
Query latency increases from seconds to minutes
Scrape targets timeout
Alerting rules fail to evaluate

Mitigation strategies:

Never use IDs as labels: User IDs, request IDs, transaction IDs belong in logs and traces, not metrics
Pre-aggregate in application: Count users per tier (premium, standard) rather than per user
Cardinality budgets: Set limits per metric (e.g., max 1000 series per metric)
Metric relabeling: Drop high-cardinality labels at scrape time

RED and USE Methods

RED Method (for services): Request-oriented metrics aligned with user experience.

Rate: Requests per second (rate(http_requests_total[5m]))
Errors: Error rate (rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]))
Duration: Latency distribution (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

USE Method (for resources): Resource-oriented metrics for infrastructure.

Utilization: Percent time resource is busy (avg(cpu_usage_percent))
Saturation: Queue depth, waiting work (avg(runnable_tasks_count))
Errors: Hardware errors, I/O failures (rate(disk_errors_total[5m]))

Design reasoning: RED tells you when users are impacted; USE tells you why. A latency spike (RED) might be caused by CPU saturation (USE). Use both methods together.

Google’s Four Golden Signals

The SRE book’s monitoring framework:

Latency: Time to service a request (distinguish success vs error latency)
Traffic: Demand on the system (requests/sec, transactions/sec)
Errors: Rate of failed requests (explicit 5xx, implicit timeouts)
Saturation: How “full” the system is (CPU, memory, queue depth)

Golden Signals overlap with RED+USE but emphasize saturation as a leading indicator. A system at 95% CPU is not yet failing (RED looks fine) but cannot absorb traffic spikes.

Distributed Tracing and Context

Tracing reconstructs request flow across service boundaries. The core abstraction: a Span represents a unit of work with defined start/end times, and Traces are directed acyclic graphs of spans sharing a TraceId.

Span Anatomy

1
Trace: abc123
2
├── Span: 001 (root) "HTTP GET /checkout" [0ms - 250ms]
3
│   ├── Span: 002 "Query user" [10ms - 30ms]
4
│   ├── Span: 003 "HTTP POST /payment-service" [35ms - 200ms]
5
│   │   └── Span: 004 "Process payment" [40ms - 195ms]
6
│   │       ├── Span: 005 "Validate card" [45ms - 60ms]
7
│   │       └── Span: 006 "Charge gateway" [65ms - 190ms]
8
│   └── Span: 007 "Update inventory" [205ms - 245ms]

Each span includes:

Operation name: What work this span represents
Timing: Start and end timestamps (microsecond precision)
Attributes: Key-value metadata (e.g., http.status_code: 200)
Events: Timestamped annotations within the span (e.g., “Retry attempt 2”)
Status: Success, error, or unset

Context Propagation

Trace context must cross process boundaries (HTTP calls, message queues, RPC). The W3C Trace Context standard defines two headers:

traceparent: 00-{trace-id}-{span-id}-{flags}

1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
2
             ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
3
             |  trace-id (16 bytes)               span-id (8 bytes) flags
4
             version                                                (01 = sampled)

tracestate: Vendor-specific extensions

1
tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7

Design reasoning: W3C Trace Context separates trace identity (traceparent) from vendor extensions (tracestate). This enables multi-vendor environments—your Datadog instrumentation can coexist with New Relic.

Prior to W3C Trace Context (standardized 2020): Multiple incompatible formats existed. Zipkin used B3 headers (X-B3-TraceId, X-B3-SpanId), Jaeger used uber-trace-id, and AWS X-Ray used X-Amzn-Trace-Id. Cross-vendor correlation required custom bridging.

Baggage vs Span Attributes

Span Attributes attach to a single span and are not propagated:

1
span.setAttribute("user.tier", "premium")
2
span.setAttribute("feature.flag.checkout_v2", true)

Baggage propagates across service boundaries:

1
// Service A: Set baggage
2
baggage.setEntry("user.tier", "premium")
3

4
// Service B: Read baggage (automatically propagated)
5
const tier = baggage.getEntry("user.tier")
6
// Must explicitly add to span if needed for querying
7
span.setAttribute("user.tier", tier)

Design reasoning: Baggage is a transport mechanism, not storage. It travels with requests but does not automatically appear in trace backends. You must explicitly copy baggage entries to span attributes if you want them queryable.

Use cases for baggage:

User ID for downstream services to log/trace
Feature flags affecting request handling
Deployment version for canary analysis
Tenant ID in multi-tenant systems

Warning: Baggage is sent in HTTP headers. Keep it small (<8KB total) and never include sensitive data.

Sampling and Cost Control

At scale, 100% trace collection is economically infeasible. A service handling 100K requests/second with 5 spans per request generates 43 billion spans per day. At typical trace storage costs ( $1-5 per million spans), that's$ 43K-$215K per day.

Sampling reduces cost but trades off completeness.

Head-Based Sampling

Decision made at trace start, before any spans are generated:

1
// Probabilistic head sampler: 10% of traces
2
const sampler = new TraceIdRatioBasedSampler(0.1)
3

4
// All downstream services respect the sampling decision
5
// via the sampled flag in traceparent header

Characteristics:

Low overhead: decision before span generation
Predictable cost: fixed percentage of traces
Blind to content: cannot sample based on errors or latency (unknown at decision time)
Efficient: non-sampled requests generate no spans

When to use: High-volume systems where cost predictability matters more than capturing every error.

Tail-Based Sampling

Decision made after trace completion, based on trace content:

1
// Tail sampling policy: keep errors and slow traces
2
const policies = [
3
  { name: "errors", type: "status_code", status_codes: ["ERROR"] },
4
  { name: "slow", type: "latency", threshold_ms: 1000 },
5
  { name: "baseline", type: "probabilistic", sampling_percentage: 5 },
6
]

Characteristics:

Higher overhead: all spans buffered until trace completes
Variable cost: depends on how many traces match policies
Content-aware: can sample based on errors, latency, attributes
Complex infrastructure: requires centralized collector with buffering

When to use: When capturing all errors is more important than cost predictability.

Hybrid Sampling Strategy

Production systems typically combine both:

Head sampling at 10%: Reduces span generation by 90%
Tail sampling on remaining 10%: Keeps errors, slow traces, and 50% random sample

Effective rate: ~5% of normal traces, ~10% of errors/slow traces.

1
100% requests
2
    │
3
    ▼ Head sampling (10%)
4
   10% span generation
5
    │
6
    ▼ Tail sampling
7
   ~5% normal traces stored
8
  ~10% error traces stored (boosted)

Sampling and Correlation

Sampling breaks trace-log-metric correlation when inconsistent:

Problem: If traces are sampled at 10% but logs are 100%, 90% of logs have no corresponding trace.

Solutions:

Always log TraceId: Even non-sampled traces have valid TraceIds
Sample logs with traces: Use the sampled flag to gate verbose logging
Exemplars: Attach trace IDs to metric samples for drill-down

Exemplars bridge metrics and traces:

1
http_request_duration_seconds_bucket{le="0.5"} 1234 # {trace_id="abc123"} 0.32

This metric sample includes a reference to trace abc123, enabling drill-down from an aggregated metric to a specific trace.

Dashboards and Alerts

Observability is only useful if it drives action. Dashboards visualize system state; alerts trigger investigation.

Dashboard Design Principles

Level-of-detail hierarchy:

Overview dashboard: SLO status, error budget, traffic. 5-10 panels max.
Service dashboard: RED metrics per service. Drill-down from overview.
Infrastructure dashboard: USE metrics per resource. Linked from service dashboard.
Debug dashboard: Detailed metrics for incident investigation. Not for routine monitoring.

Anti-pattern: Wall of graphs that no one looks at. If a panel does not drive action, remove it.

Alert Design Principles

Alert on symptoms, not causes:

Good: Error rate > 1% for 5 minutes (user impact)
Bad: CPU > 80% (might be normal, might not affect users)

Multi-window alerting (SLO-based):

1
# Alert when burning error budget too fast
2
# 2% of monthly budget consumed in 1 hour = 36x burn rate
3
- alert: HighErrorBudgetBurn
4
  expr: |
5
    (
6
      sum(rate(http_requests_total{status=~"5.."}[1h]))
7
      / sum(rate(http_requests_total[1h]))
8
    ) > (0.02 / 720)  # 2% budget in 1/720th of month
9
  for: 5m

Alert fatigue mitigation:

Require sustained condition (for: 5m) to avoid flapping
Page only on user-impacting symptoms
Ticket/warning for leading indicators (saturation)
Review and prune alerts quarterly

Runbooks

Every alert should link to a runbook covering:

What this alert means: Symptom description
Impact assessment: User-facing? Which users?
Investigation steps: Which dashboards? What queries?
Remediation options: Scaling? Rollback? Restart?
Escalation path: Who to contact if stuck

Conclusion

Effective observability requires understanding the fundamental trade-offs:

Logs provide complete context but are expensive and slow to query. Use them for debugging specific incidents, not routine monitoring. Structure them for queryability and keep high-cardinality fields out of indexes.

Metrics enable aggregation and alerting but require bounded cardinality. Never use unbounded values (IDs, emails) as metric labels. Design histograms with SLO-aligned buckets.

Traces reconstruct distributed request flow but require sampling. Choose sampling strategy based on whether you prioritize cost predictability (head) or error capture (tail). Hybrid approaches offer balance.

Context propagation (W3C Trace Context) links all three signals. Without consistent TraceId propagation, correlation is manual and unreliable.

The cost hierarchy—metrics cheapest, traces middle, logs expensive—should guide your instrumentation strategy. Capture metrics for everything, sample traces for request flow, and log detailed context only when investigating.

Appendix

Prerequisites

Familiarity with distributed systems concepts (services, RPC, message queues)
Basic understanding of time-series data and aggregation
Experience with at least one monitoring tool (Prometheus, Datadog, etc.)

Terminology

Cardinality: Number of unique values a field can take; in metrics, number of unique label combinations
OTLP: OpenTelemetry Protocol, the wire format for exporting telemetry
SLO: Service Level Objective, a target for service reliability (e.g., 99.9% availability)
TSDB: Time-Series Database, optimized for timestamped data (e.g., Prometheus, InfluxDB)
Exemplar: A sample metric data point with trace ID attached for drill-down

Summary

Logs, metrics, and traces answer different questions: logs for “what happened,” metrics for “how is it behaving,” traces for “how did the request flow”
Cardinality is the critical constraint for metrics; unbounded labels crash TSDBs
Head sampling is predictable but blind; tail sampling captures errors but costs more
W3C Trace Context (traceparent, tracestate) is the standard for context propagation
RED (Rate, Errors, Duration) measures user experience; USE (Utilization, Saturation, Errors) diagnoses infrastructure
Alert on symptoms (error rate) not causes (CPU usage)

References

OpenTelemetry Specification - Authoritative specification for traces, metrics, logs, and baggage
W3C Trace Context - Standard for distributed trace context propagation
Prometheus Documentation: Histograms and Summaries - Histogram design and query patterns
Google SRE Book: Monitoring Distributed Systems - Four Golden Signals and monitoring philosophy
OpenTelemetry Semantic Conventions - Standardized attribute names for telemetry
The RED Method - Tom Wilkie’s service-oriented monitoring framework
USE Method - Brendan Gregg’s resource-oriented analysis method
OpenTelemetry Sampling - Head and tail sampling strategies