Observability Stack — Compute Reference

A reference for understanding modern production observability: the three (now four) signal pillars, the open-source and commercial stacks that move and store them, the reliability discipline (SLOs/SLIs/SLAs, error budgets) that gives the signals meaning, and the operational pitfalls that turn an observability program into a cost-and-noise sink.


1. At a glance — the pillars

Observability is the property of a system that lets you ask new questions about its internal state from its external outputs, without shipping new code. It is distinct from monitoring, which is the act of watching predefined dashboards and firing alerts on predefined conditions. Monitoring answers known questions about known failure modes; observability lets you debug failure modes you did not anticipate.

The pillars (Majors, Fong-Jones, Miranda, Observability Engineering, 2022):

  • Metrics — numeric time series, aggregated, low-cardinality. Cheap to store, fast to query, lose information per event. The dashboard substrate.
  • Logs — discrete events, usually text or structured JSON, high volume. Verbose but per-event detail.
  • Traces — causally linked spans across services for a single request. The only signal that captures how services interact under a single logical operation.
  • Profiles — sampled stack traces over time, attributed to functions/lines, used for CPU, memory allocation, off-CPU, and lock contention. The fourth pillar, treated as first-class by OpenTelemetry as of 2024.

Two collection modes complement each other:

  • Black-box — probe the system externally (HTTP health checks, synthetic transactions). Sees the system as a user sees it; cannot explain why.
  • White-box — emitted by instrumented code inside the system (counters, traces, structured logs). Explains internal causation.

Modern stacks blend all four pillars and both modes, with correlation IDs stitching them together so a single user complaint can be expanded from a black-box failure into the failing span, the related logs, and the function consuming CPU at that instant.


2. SLO, SLI, SLA, and the error budget

The discipline that gives observability its purpose comes from Google’s SRE practice, codified in Site Reliability Engineering (Beyer et al., 2016) and its sequels.

  • SLI — Service Level Indicator. A measurement of one aspect of service health. E.g., p99 latency of /checkout over 5 minutes, fraction of HTTP requests returning 2xx or 3xx, bytes successfully written and acknowledged per second.
  • SLO — Service Level Objective. A target for an SLI over a window. E.g., p99 latency < 200 ms over rolling 30-day window, success rate ≥ 99.9% over 28 days. The SLO is internal; the team commits to it.
  • SLA — Service Level Agreement. A customer-facing contract with consequences (credits, refunds). Usually weaker than the SLO so the team has internal slack. E.g., SLO 99.95% with SLA 99.9%.

The error budget is the inverse of the SLO. If the SLO is 99.9% over 28 days, the budget is 0.1% of 28 days ≈ 40 minutes 19 seconds of allowed failure. While the budget has remaining balance, the team can ship features aggressively; when the budget is depleted, the policy is to halt feature work and spend engineering effort on reliability until the budget regenerates. This converts reliability from a debate into a numeric resource.

Burn-rate alerts (Google SRE Workbook, 2018) are the modern multi-window approach: alert when the budget is burning fast enough that, sustained, it would exhaust in a small fraction of the window. Typical configuration is two windows (e.g., 1 hour + 5 minutes) at two thresholds (e.g., 14.4x burn for fast, 6x burn for slow) so that brief spikes do not page but sustained problems do.

A small set of well-chosen SLIs (typically the “Four Golden Signals” — latency, traffic, errors, saturation, from the SRE book) outperforms many vanity metrics.

Choosing good SLIs

Not every metric makes a good SLI. The criteria:

  • User-meaningful. A latency SLI on a backend internal endpoint matters only insofar as it composes into a user-perceived latency. Page on user-visible symptoms; instrument internals to explain them.
  • Aggregable. SLIs that combine cleanly across instances and shards (request count, error count, latency histograms) work in distributed systems. Gauges (memory usage at this instant) generally do not aggregate well.
  • Implementable cheaply. An SLI you sample 10× per second is worth more than one you can only measure end-of-day from a batch job.
  • Ratio-based for SLOs. “Fraction of requests successful in last 28 days” is operable; “average latency” is not (averages hide tail behavior). Prefer success-ratio SLOs and percentile-based latency SLOs.

A common rough sketch for a request-oriented service: one availability SLI (success ratio), one latency SLI (p99 or distribution-based “fraction of requests under threshold”), and one freshness SLI for any data the service publishes (e.g., 99% of pipeline rows arrive within 10 min of source). For a job system: a throughput SLI and a correctness SLI (rows reconciled with source within tolerance).

Composite and dependency SLOs

A product SLO (“checkout works”) is the product of upstream service SLOs. If checkout depends on five services each at 99.9%, the worst-case composed SLO is ~99.5%, which is a useful sanity check on whether upstream commitments are achievable. The SRE Workbook chapter on “Implementing SLOs” walks through this composition explicitly.


3. Metrics — Prometheus and its ecosystem

Prometheus (originated at SoundCloud in 2012, donated to CNCF, graduated 2018) is the de facto open-source metrics system for cloud-native infrastructure. Its design choices are the canonical ones now:

  • Pull-based scraping. Prometheus polls HTTP endpoints exposing /metrics in the Prometheus exposition format. Targets are discovered (not configured statically) via service discovery. Pull simplifies failure detection (if the target is gone, the scrape fails — a positive signal) and avoids the global-write problem of push systems.
  • Local time-series database. Two-hour blocks on disk; head block in memory; compaction; WAL for crash safety. Designed for single-node operation; long-term storage is a separate concern.
  • PromQL. A functional language over labeled time series. rate(), irate(), histogram_quantile(), sum by (label)(...), topk(), recording rules, and alerting rules.
  • Labels are the cardinality knob. Each unique combination of label values creates a new time series. High-cardinality labels (user IDs, request IDs, full URL paths) explode storage and query cost — the canonical operator footgun.

Exporters

Prometheus does not push agents into your code; the Prometheus client libraries (Go, Python, Java, Ruby, .NET, Rust) expose /metrics. For things you cannot modify, exporters translate native telemetry to the Prometheus format:

  • node_exporter — Linux/Unix host metrics: CPU, memory, disk, network, filesystem.
  • windows_exporter — equivalent for Windows.
  • blackbox_exporter — HTTP/HTTPS/TCP/ICMP/DNS probes. The standard black-box monitoring source.
  • kube-state-metrics — Kubernetes object state (pod phases, deployment status, replica counts). Distinct from the Kubernetes metrics-server (which serves resource usage for the HPA).
  • cAdvisor — container-level CPU/memory/network, embedded in kubelet.
  • mysqld_exporter, postgres_exporter, redis_exporter, mongodb_exporter — datastores.
  • JMX exporter — JVM metrics via JMX (Kafka, Cassandra, Elasticsearch all expose JMX).
  • snmp_exporter — networking gear.

Service discovery

Static configuration does not scale past tiny environments. Prometheus supports SD for:

  • Kubernetes — pods, services, endpoints, nodes, ingresses via the K8s API.
  • Consul, etcd, ZooKeeper — service registries.
  • EC2, GCE, Azure, DigitalOcean, Hetzner, Linode, Scaleway, OpenStack — cloud SDs.
  • DNS SD — SRV/A records.
  • file_sd — JSON/YAML files written by external automation. The fallback when nothing else fits.

Long-term and horizontal storage

Prometheus is single-node by design. Production deployments use one of:

  • Thanos (Improbable, 2017; CNCF incubating) — sidecar that ships blocks to object storage (S3, GCS, Azure Blob, Swift), querier that fans out to multiple Prometheuses, store gateway for historical reads, compactor for downsampling.
  • Cortex (Weaveworks → CNCF incubating) — multi-tenant horizontally scalable Prometheus-compatible service. The basis of Grafana Cloud’s hosted Prometheus.
  • Grafana Mimir (Grafana Labs, 2022 — fork from Cortex with AGPL relicensing) — horizontally scalable, multi-tenant, object-storage-backed; the current Grafana-stack recommendation.
  • VictoriaMetrics — high-performance Prometheus-compatible TSDB, often deployed standalone instead of Prometheus + remote storage; aggressive on compression and ingest throughput.
  • M3DB (Uber, 2017) — high-throughput TSDB, less common today than the above.

PromQL idioms

# Per-second rate of HTTP requests over 5 minutes, summed by status code.
sum by (status) (rate(http_requests_total[5m]))
 
# p99 latency from a histogram, bucketed by le.
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
 
# Burn-rate alert (SLO 99.9%, fast window).
sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 14.4 * (1 - 0.999)
 
# Saturation: percentage of disk space used.
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

Alertmanager and the push gateway

  • Alertmanager — separate process; receives alerts from Prometheus, performs grouping (collapsing similar alerts), inhibition (silencing dependent alerts when a parent fails), silencing (manual mutes), deduplication across HA Prometheus pairs, and routing to receivers (PagerDuty, Slack, Opsgenie, email, webhook).
  • Pushgateway — receives metrics pushed from short-lived jobs (cron jobs, batch jobs) so they can be scraped by Prometheus before they exit. Use sparingly; the canonical anti-pattern is pushing service-level metrics through it (which then live forever past the service’s lifetime). For true batch jobs only.

OpenMetrics

OpenMetrics (CNCF, 2017) is the standardization of the Prometheus text exposition format. Adds protobuf encoding, exemplars (links from a metric data point to a trace ID), and clearer semantics. Prometheus, OpenTelemetry, and most exporters now support OpenMetrics natively.

Native histograms

Classic Prometheus histograms emit fixed-bucket counters (le="0.005", le="0.01", …), which forces a guess at bucket boundaries up-front: too narrow misses the distribution, too wide makes histogram_quantile() lie. Native histograms (Prometheus 2.40, 2022; still experimental but production-used) store an exponential-bucket structure where bucket boundaries are derived from a schema; storage is dense, queries are exact within the resolution, and no upfront bucket choice is needed. OpenTelemetry’s exponential histogram signal is the same idea and bridges into Prometheus via remote-write. Adopting native histograms is the easiest large win on latency-SLO accuracy.


4. Logs

Aggregation agents (collectors that ship logs to a store)

  • Fluent Bit (CNCF, written in C, ~640 KB image) — the modern lightweight agent of choice for Kubernetes and edge. Replaces Fluentd in most new deployments.
  • Fluentd (CNCF, written in Ruby) — older, plugin-rich (>1000 plugins), heavier.
  • Vector (Datadog OSS, written in Rust, 2019) — single binary, very fast, treats logs/metrics/traces as a unified pipeline with a VRL transform language.
  • Filebeat (Elastic) — part of the Elastic Beats family, tightly integrated with Elasticsearch/OpenSearch and Logstash.
  • Logstash (Elastic, JRuby) — heavier transform engine; often replaced by Vector or pipeline-side processing.
  • Splunk Universal Forwarder — Splunk’s agent; sends to Splunk indexers.
  • OpenTelemetry Collector — increasingly used for logs too, unifying the agent across all three signals.

Stores

  • Loki (Grafana Labs, 2018) — indexes labels only, not log content; chunks compressed and shipped to object storage. Cheap by an order of magnitude vs Elasticsearch; queried with LogQL, deliberately similar to PromQL. Cannot do full-text search efficiently — its design tradeoff is that you label aggressively and grep at query time.
  • Elasticsearch (Elastic NV) — inverted-index full-text search via Lucene. The most flexible query model for logs; expensive at scale.
  • OpenSearch — AWS-led fork of Elasticsearch 7.10 after Elastic’s 2021 license change. Same architecture, Apache 2.0.
  • Splunk Enterprise / Splunk Cloud — proprietary, mature, expensive (priced per GB ingested per day historically; workload pricing more recently). SPL is the query language.
  • Datadog Logs, New Relic Logs, Sumo Logic, Logz.io, Better Stack / Logtail — managed SaaS.
  • AWS CloudWatch Logs, Google Cloud Logging, Azure Monitor Logs — cloud-provider native.
  • Grafana Loki + S3 vs Quickwit vs OpenObserve — newer entrants leaning on object-storage economics.

Structured logging

Unstructured printf-style logs ("User 12345 made a request") require regex extraction at query time and are brittle. Structured logging emits JSON (or logfmt) with explicit fields: {"ts":"...", "level":"info", "user_id":"12345", "route":"/checkout", "trace_id":"abc..."}. Every modern logging library supports it (zap, zerolog, slog in Go; structlog in Python; Serilog in .NET; SLF4J + logstash-logback-encoder in JVM).

Correlation IDs — every log line should carry a trace_id and span_id from the active OpenTelemetry context, plus an HTTP request ID propagated from the load balancer. This is what lets you pivot from a log entry to the parent trace and back.

Log levels and discipline

Standard levels (per syslog and most logger libraries): TRACE, DEBUG, INFO, WARN, ERROR, FATAL. The convention worth enforcing:

  • ERROR means a human should look at this. If your codebase emits ERROR routinely (auth failures from a brute-force scanner, expected DB constraint violations) you have diluted the level until it is useless. Move expected failures to WARN or INFO with a structured field; reserve ERROR for genuine “this needs attention.”
  • WARN means a degradation or a deprecated path — visible in dashboards, not paging.
  • INFO is the runtime narrative: lifecycle events, request summaries, decisions.
  • DEBUG is opt-in detail for debugging; off in prod for most services or sampled at 1%.
  • TRACE is per-call internal detail; almost always off in prod.

Dynamic log-level controls (level overrides per-logger via a config endpoint, feature flag, or signal) let you turn on DEBUG for one service for 15 minutes during a live debug without redeploying. Java SLF4J + Logback, Python logging, and most Go loggers support this; OpenTelemetry has a draft “dynamic logging” spec.

Sampling

For high-volume services, log every event is expensive. Two strategies:

  • Head-based — decide at log time, often probabilistically (1% of debug logs). Cheap, but you lose context for events you will later regret discarding.
  • Tail-based — buffer logs for the duration of a trace, then decide whether to keep based on the trace outcome (error? slow?). Implemented in the OpenTelemetry Collector tail-sampling processor for traces; rarer for logs because of the buffering cost.

5. Traces

A trace is a tree of spans. Each span represents one unit of work (a function call, an RPC, a DB query). Spans have a trace_id (shared across the whole trace), a span_id (unique to this span), a parent_span_id, a start time, a duration, and a bag of attributes and events.

W3C Trace Context

The W3C Trace Context specification (W3C Recommendation, February 2020) defines two HTTP headers that propagate trace context across service boundaries:

  • traceparent: 00-{trace-id}-{parent-id}-{trace-flags} — version, 16-byte trace ID, 8-byte parent span ID, flags (sampled bit).
  • tracestate: vendor1=opaqueValue,vendor2=opaqueValue — vendor-specific extensions in a comma-separated list.

This replaced the proliferation of Zipkin’s B3 headers and Jaeger’s uber-trace-id. Every modern OpenTelemetry SDK propagates W3C Trace Context by default; B3 is still supported for legacy interop.

OpenTelemetry

OpenTelemetry (OTel) is the CNCF vendor-agnostic instrumentation standard, formed in 2019 by merging OpenTracing (the tracing API standard) and OpenCensus (Google’s stats + tracing library). It is currently the second-most-active CNCF project by contributor count (Kubernetes is first).

OTel comprises:

  • API — language-specific surface a developer uses to create spans and metrics. Stable in nearly every major language.
  • SDK — the reference implementation of the API.
  • Auto-instrumentation libraries — wrap common libraries (HTTP servers, databases, message queues, RPC frameworks) and emit spans without code changes. The Java and .NET agents are particularly mature; Python, Node.js, Go, and Ruby are well-supported.
  • Collector — the standalone process that receives, processes, and exports telemetry.
  • Semantic conventions — agreed attribute names (http.request.method, db.system, messaging.destination.name, gen_ai.system) so dashboards and alerts work across services and vendors.

OpenTelemetry Collector

A pipeline of receivers → processors → exporters:

  • Receivers: OTLP (gRPC + HTTP), Jaeger, Zipkin, Prometheus scrape, Kafka, FluentForward, syslog, statsd, host metrics, K8s events, AWS CloudWatch.
  • Processors: batch (efficient export), memory_limiter, attributes (rename/redact), resource detection, tail_sampling (decide based on trace outcome), probabilistic_sampler, transform (OTTL — OpenTelemetry Transformation Language), filter, k8sattributes (enrich with pod metadata).
  • Exporters: OTLP (forward to another collector or backend), Jaeger, Zipkin, Prometheus, Loki, Tempo, Mimir, Datadog, New Relic, Honeycomb, Lightstep, AWS X-Ray, GCP Cloud Trace, Azure Monitor, Elasticsearch.

Deployment patterns: agent (one per node, DaemonSet in K8s), gateway (cluster-wide aggregation point that handles tail sampling and vendor fan-out), or a two-tier mix. The collector is the integration substrate that lets you switch backends without changing application code.

Trace stores

  • Jaeger (Uber 2016, CNCF graduated 2019) — the open-source reference. Cassandra, Elasticsearch, or BadgerDB backends; the v2 architecture uses OpenTelemetry Collector as its data plane.
  • Tempo (Grafana Labs, 2020) — object-storage-backed (S3, GCS, Azure Blob), no full index — relies on Loki/Mimir labels and trace ID lookup; very cheap. Queried via TraceQL (TraceQL released GA 2023).
  • Zipkin (Twitter 2012) — the original distributed tracing system; still supported but largely superseded by Jaeger and Tempo.
  • Grafana Cloud Traces, Datadog APM, New Relic Distributed Tracing, Honeycomb, Lightstep / ServiceNow Cloud Observability, Elastic APM, Dynatrace, AppDynamics, AWS X-Ray, GCP Cloud Trace, Azure Application Insights — managed.

Honeycomb deserves a special note: its model is “high-cardinality wide events” — a single structured event per request with hundreds of attributes — rather than separate metrics, logs, and traces. Queries pivot on any attribute (group by user_id where error = true) at any cardinality, with sub-second response. Its BubbleUp feature surfaces attribute distributions distinguishing slow/error events from normal ones automatically.

Comparing Jaeger, Tempo, and the commercial backends

The practical tradeoff between the leading open-source trace stores:

  • Jaeger has an inverted index over span attributes; queries like “find traces where http.status_code = 500 and user.id = X” are fast. The index storage is expensive (Cassandra or Elasticsearch behind it). Good when your queries pivot frequently on arbitrary attributes.
  • Tempo keeps no per-attribute index by default; lookups by trace ID are O(1) via object-storage manifests, but attribute searches degrade unless you pre-define a small set of “search tags” or front Tempo with a Loki-style label index. The savings are an order of magnitude in storage cost. Good when you arrive at traces via metrics exemplars or logs (you already know the trace ID by the time you hit the trace store).
  • Datadog APM, Honeycomb, New Relic index attributes aggressively and let you ask arbitrary high-cardinality questions; you pay per event ingested and per attribute.

Exemplars are the bridge that makes “trace-store as ID lookup” workable: a Prometheus histogram bucket can carry trace IDs of representative requests, so a Grafana panel showing latency surfaces a clickable ”→ trace” link directly from the metric. OpenTelemetry, Prometheus (with exemplar storage enabled), and Grafana 8+ all support this end-to-end.

Span structure

  • Attributes — key/value pairs on the span. http.response.status_code, db.statement, user.id.
  • Events — timestamped log messages within a span. Cheaper than a separate log line because they share context.
  • Links — references from one span to another span in a different trace; useful for async/batch jobs where one request spawns N downstream that should reference back.
  • StatusOK, ERROR, UNSET.

Service maps

Most APM products derive a service dependency graph from traces: which services call which, with traffic volume, error rate, and latency on each edge. Auto-discovered, more accurate than hand-maintained architecture diagrams.


6. Profiling — the fourth pillar

Continuous profiling means running a low-overhead sampling profiler in production all the time, attributing CPU, memory allocations, off-CPU time, and lock contention to specific lines of code, by service, by deploy, by tag.

Foundations

  • perf_events — the Linux kernel performance subsystem; provides sampled stack traces with very low overhead.
  • eBPF — programmable kernel virtual machine; lets profilers attribute stacks without any application instrumentation.
  • FlameGraph (Brendan Gregg, 2011) — visualization of stack samples; width = time, height = call depth. The standard visualization for sampled CPU profiles.

Profiling tools

  • Pyroscope (Pyroscope.io, acquired by Grafana 2023; now Grafana Pyroscope) — open-source continuous profiling, multi-language (Go, Python, Ruby, .NET, Java, Rust, Node.js), eBPF agent for unmodified binaries. Queried via the Grafana UI.
  • Parca (Polar Signals, 2021) — eBPF-based continuous profiler, pprof-compatible storage, no language SDK needed.
  • Datadog Continuous Profiler — bundled with Datadog APM.
  • Google Cloud Profiler, AWS CodeGuru Profiler — cloud-provider native.
  • py-spy (Ben Frederickson, 2018) — Python sampling profiler that reads stack from another process via ptrace; standard for Python.
  • async-profiler (Andrei Pangin) — JVM sampling profiler using AsyncGetCallTrace; standard for Java/Kotlin/Scala.
  • rbspy — equivalent for Ruby.
  • perf + bcc + bpftrace — the underlying Linux tooling.
  • Pyflame — deprecated.

Tradeoffs

  • Sampling (default) — record stack at fixed interval (e.g., 100 Hz). Statistical, low overhead (<1% typical), can miss short-lived events.
  • Instrumentation — add hooks at function entry/exit. Exact, much higher overhead; reserved for development.
  • Differential profiles — compare profile between two builds, or between baseline and an outage, to find regressions. Grafana Pyroscope, Polar Signals, and Datadog all support diff views.

7. OpenTelemetry deeper

Signal types and stability (as of 2024–25)

  • Traces — stable in all major SDKs (Go, Java, Python, .NET, Node.js, Ruby, Rust, C++, PHP, Swift). The first signal to stabilize.
  • Metrics — stable in Go, Java, .NET, Python, Node.js; varies elsewhere. OTLP metrics support is universal in backends.
  • Logs — stable in most SDKs; the integration is via existing logging libraries (e.g., a slog handler in Go, a Logger provider in Java). Bridges write logs into the OTel pipeline preserving the trace context.
  • Profiles — in development; OTLP profile signal stabilizing late 2024. Pyroscope, Parca, and the OTel collector have early adopter support.

Auto-instrumentation

Auto-instrumentation libraries hook the runtime (Java Instrumentation API, .NET CLR profiling, Python monkeypatching, Node.js require hooks, eBPF for Go and C/C++) and emit spans for HTTP servers, HTTP clients, databases, queues, RPC frameworks, gRPC, message brokers (Kafka, RabbitMQ, SQS, Pub/Sub) — typically dozens of libraries per language.

Practical pattern: enable auto-instrumentation first, get a working trace pipeline with semantic-conventional attributes, then add manual spans for business-meaningful boundaries (order.process, payment.authorize, cart.checkout).

Resource detection

A resource is the entity producing telemetry: a service instance in a container in a pod on a node in a region. Resource detectors auto-populate attributes:

  • service.name, service.version, service.instance.id, service.namespace.
  • host.name, host.id, host.arch, os.type.
  • k8s.pod.name, k8s.namespace.name, k8s.deployment.name, k8s.node.name.
  • cloud.provider, cloud.region, cloud.availability_zone, cloud.account.id.
  • container.id, container.image.name, container.image.tag.

Resource attributes apply to every span/metric/log emitted by that process; the K8sAttributes processor in the collector adds them from the K8s API for free.

Semantic conventions

The OTel project maintains a registry of attribute names with agreed semantics so dashboards generalize across services and vendors. Some examples:

  • HTTP: http.request.method, http.response.status_code, url.path, url.scheme, server.address, server.port, user_agent.original.
  • Database: db.system, db.namespace, db.operation.name, db.query.text.
  • Messaging: messaging.system, messaging.destination.name, messaging.operation.type.
  • RPC: rpc.system, rpc.service, rpc.method, rpc.grpc.status_code.
  • Generative AI: gen_ai.system, gen_ai.request.model, gen_ai.response.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens (added 2024).
  • FaaS: faas.invocation_id, faas.trigger, faas.coldstart.

Context propagation

Trace context propagates across services via:

  • HTTP: traceparent and tracestate headers (W3C).
  • gRPC: metadata.
  • Kafka: record headers.
  • AWS SQS / SNS: message attributes.
  • Pub/Sub: message attributes.

Custom propagators handle anything else. Baggage is a separate W3C concept — propagated key/value pairs visible to all downstream services for the duration of the trace; useful for tenant ID, region, debug flags. Use sparingly: every baggage key inflates every header.

Sampling

  • Probabilistic / head-based — decide at root span (e.g., sample 1% of traces). Cheap, simple, biased against rare expensive traces.
  • Parent-based — propagate the sampling decision from the parent; one decision per trace.
  • Tail-based (collector-side) — buffer the whole trace for some window (e.g., 30 s), then decide based on outcome: always sample errors, always sample slow traces, sample by some percentage of normal. Tradeoffs: collector memory cost, fanout complexity (all spans of a trace must hit the same collector instance — usually via consistent hashing on trace ID).

8. Dashboards and visualization

  • Grafana — the open-source dashboard standard. Multi-datasource (Prometheus, Loki, Tempo, Mimir, Elasticsearch, InfluxDB, MySQL, Postgres, CloudWatch, BigQuery, dozens more), templated variables, alerting (Grafana Alerting subsumes a lot of historical Alertmanager use), playlists, public dashboards, scenes. Grafana 11+ (2024) brought scenes-based dashboards as the default.
  • Kibana — the Elastic/OpenSearch UI; dashboards, Discover (log search), Lens (visualization builder), Maps, Machine Learning (Elastic).
  • Datadog dashboards, New Relic One, Honeycomb boards, Splunk dashboards — vendor UIs.
  • Perses (CNCF sandbox, 2023) — Kubernetes-native, GitOps-friendly dashboard system; an alternative to Grafana when GitOps is paramount.

Operational discipline: dashboards should be derived from SLOs, not the other way around. The four-golden-signals dashboard per service, plus an SLO-burn-rate dashboard per SLO, plus a deploy-comparison dashboard per service, covers most needs. Beyond that lies dashboard sprawl.


9. APM and RUM

Application Performance Monitoring (APM) is the server-side pillar: traces + metrics + logs, with code-level attribution and service maps, packaged as a turnkey product.

  • Datadog APM, New Relic APM, Dynatrace, AppDynamics (Cisco), Elastic APM, Instana (IBM), Honeycomb, Lightstep / ServiceNow Cloud Observability, Sentry Performance, Splunk APM (formerly SignalFx).

Real User Monitoring (RUM) instruments the browser or mobile app, capturing:

  • Core Web Vitals — Largest Contentful Paint (LCP), Interaction to Next Paint (INP — replaced FID in 2024), Cumulative Layout Shift (CLS).
  • Page load timing breakdown (TTFB, DOMContentLoaded, load, first paint).
  • JS errors and unhandled rejections.
  • User session replays (Sentry, Datadog, LogRocket, FullStory, Hotjar).
  • Network requests with timings.

RUM stitches the user experience to the backend trace via W3C traceparent on outbound XHR/fetch.

  • Sentry RUM (formerly Sentry Performance — Browser), Datadog RUM, New Relic Browser, Honeycomb (via OTel JS SDK), Cloudflare Browser Insights, Elastic RUM, Splunk RUM, AWS CloudWatch RUM.

10. Error tracking

A purpose-built lane within observability for exceptions and unhandled errors:

  • Sentry — open-core, self-hostable, multi-language, deduplicates errors into “issues” by stack trace fingerprint, captures breadcrumbs (recent events leading to the error), release tracking, source maps for minified JS, performance monitoring.
  • Rollbar — similar feature set, hosted-only.
  • Bugsnag (SmartBear) — hosted, mobile-strong.
  • Honeybadger — hosted, Ruby-strong.
  • Raygun, Airbrake — older entrants.
  • Datadog Error Tracking, New Relic Errors Inbox — bundled.

Key features: fingerprinting (collapse 10,000 occurrences of the same NullPointerException into a single issue), breadcrumbs (a ring buffer of preceding events), release health (crash-free session percentage per release), assignment + workflow (assign issue to engineer, resolve in release N, regression alerts), source map upload for minified frontend code.


11. Alerting and on-call

  • Alertmanager (Prometheus) — routing, grouping, inhibition, silencing, deduplication.
  • Grafana Alerting (since Grafana 8) — unified alerting across all Grafana datasources; replaced separate “legacy Grafana alerts” and increasingly competes with Alertmanager.
  • PagerDuty — the original on-call platform; rotations, escalation policies, response automation, post-incident reviews, integrations everywhere.
  • Opsgenie (Atlassian) — similar feature set, deep Atlassian integration. (Note: Atlassian announced Opsgenie EOL April 2027, migrating customers to Jira Service Management.)
  • Splunk On-Call (formerly VictorOps) — Splunk-integrated.
  • Squadcast, incident.io, FireHydrant, Rootly, Jeli (PagerDuty) — modern incident-response platforms; some bundle alerting + war-room + post-mortem.
  • Grafana OnCall — open-source, fork of Amixr.

Severity ladder

A typical (organization-specific) severity ladder:

  • SEV-1 / P1 — customer-impacting major outage; page everyone, war room.
  • SEV-2 / P2 — degraded service, paging on-call.
  • SEV-3 / P3 — bug or partial degradation, business-hours response.
  • SEV-4 / P4 — minor, ticket-only.

Burn-rate alerts

The modern SRE pattern (SRE Workbook, ch. 5) for SLO alerts is multi-window, multi-burn-rate:

  • Fast burn: 14.4× burn rate sustained over 5 min and 1 h — pages immediately.
  • Slow burn: 6× burn rate sustained over 30 min and 6 h — pages slower.
  • Page only when both windows are firing — eliminates flapping on transient spikes.

This replaces the older “alert when X > Y” model and dramatically reduces noise. Alert fatigue is the most common cause of missed real incidents.

Worked example with a 99.9% SLO over a 30-day window:

  • Budget = 0.1% × 30 days = 43.2 minutes.
  • A 14.4× burn rate consumes 2% of the budget per hour. In 1 hour at this rate, ~52 seconds of the 43.2-minute budget gone — and if sustained for 50 hours the full budget is gone. Page immediately.
  • A 6× burn rate consumes ~0.83% per hour. At 6 hours sustained, 5% of the budget gone; tolerable to investigate within business hours rather than at 3 a.m.

The multi-window requirement (both 5 min AND 1 h for the fast page) prevents a 90-second spike from waking anyone; the slow rate’s longer window catches sustained low-grade degradation that the fast rate would miss.

Incident lifecycle and roles

A mature on-call setup separates concerns explicitly during a SEV-1:

  • Incident Commander (IC) — coordinates, sets priorities, owns the timeline; does not debug.
  • Operations Lead / Subject Matter Experts — investigate and apply mitigations.
  • Communications Lead — updates status page (Atlassian Statuspage, Better Stack Status, Instatus), customer support, internal stakeholders.
  • Scribe — keeps the running timeline in the incident channel for the post-mortem.

The roles are sized to the incident; a SEV-3 might be one engineer wearing all four hats. The point is that during a SEV-1 nobody is doing two of these at once because each is full-time.

Runbooks

Every alert should link to a runbook with: what this means, what the symptoms look like, the diagnostic steps, the mitigation steps, and the rollback. Modern runbooks live in markdown in the repo (or in incident.io / Runbook Service) and are version-controlled. Runbook-less alerts are tech debt.


12. High-cardinality and structured events

Honeycomb’s house philosophy, increasingly mainstream:

Emit one wide event per request — a single structured record with as many attributes as possible (user ID, tenant ID, route, response code, latency, build SHA, region, feature flags in effect, downstream dependency status, etc.). Replace metrics + logs + traces (where feasible) with arbitrary slice-and-dice queries on the event store.

The defining capability is high cardinality — being able to group by user ID, request ID, or any other dimension with millions of unique values. Prometheus collapses under this load (each label combination is its own time series); a wide-event store (Honeycomb, Datadog, Splunk to a lesser degree) handles it.

The classic Prometheus footgun: never use unbounded-cardinality values as label keys. user_id, request_id, full URL paths with IDs (/orders/{order_id}), session tokens. Sanitize routes (/orders/:order_id), bucket numeric values, and emit per-user data only as events (in logs or a wide-event store), never as metric labels.


13. eBPF observability

eBPF (extended Berkeley Packet Filter) is the Linux kernel’s programmable VM that lets safe, verified programs run in kernel context. It enabled a wave of zero-instrumentation observability tools:

  • Cilium Hubble — network observability for the Cilium CNI; service-to-service flows, L7 protocol inspection (HTTP, gRPC, DNS, Kafka), policy decision visibility.
  • Pixie (New Relic, open source, CNCF sandbox) — auto-instruments K8s clusters from eBPF — HTTP, MySQL, Redis, Postgres, Kafka traffic captured without code changes; LLM-assisted query interface (PxL language).
  • Tetragon (Isovalent / Cisco) — eBPF-based security observability and runtime enforcement; sister project to Cilium.
  • Parca, Grafana Pyroscope eBPF agent — continuous profiling without language SDKs.
  • bcc — BPF Compiler Collection; Python wrappers around BPF programs; the classic toolkit (Gregg).
  • bpftrace — high-level tracing language; the awk for kernel and userspace tracing.
  • Inspektor Gadget, Falco — eBPF-driven runtime security.

The shared pitch: observability without instrumenting your application. The limits: eBPF sees system calls, packets, kernel events; it does not see “what happened logically inside a function” the way a manually placed span does. The two layers complement each other.


14. Distributed-system observability patterns

  • Correlation ID propagation end-to-end — at the ingress (load balancer or gateway) generate a request ID, propagate as X-Request-ID or via W3C Trace Context, log on every hop. The single most valuable practice for debugging distributed failures.
  • SLOs per service, not just per product — every service owner has SLIs and SLOs; the product SLO composes them.
  • Black-box synthetic monitoring — periodic synthetic transactions against production endpoints from external probes. Tools: Datadog Synthetics, Grafana k6 Cloud / k6 OSS, Cloudflare Health Checks, Pingdom, UptimeRobot, Checkly, Catchpoint. They catch problems metrics miss (DNS misconfiguration, certificate expiration, broken third-party dependency from a user’s perspective).
  • Chaos engineering — deliberately inject failure to verify the system recovers. Tools: Gremlin (commercial), Chaos Mesh (CNCF, Kubernetes-native), LitmusChaos (CNCF, Kubernetes-native), AWS Fault Injection Simulator, Azure Chaos Studio. Tied to observability: you cannot meaningfully run a game day without traces and SLOs to measure the blast radius.
  • Game days — scheduled exercises to test runbooks, on-call response, and the observability stack itself. Find broken alerts before a real outage does.
  • Load testing as observability validationk6, Locust, Gatling, JMeter, Vegeta driving production-shaped load against staging environments. The point is partly to find perf regressions, but as much to verify that traces, metrics, and dashboards behave intelligibly under load. An observability stack that never sees realistic load is untested infrastructure.
  • Blameless post-mortems — Google SRE foundational practice. Document the timeline, the root causes (often plural), the contributing factors, the action items, and the lessons learned. Blame the system, not the individual; otherwise people hide information.
  • Incident command — for SEV-1, a single Incident Commander coordinates; subject matter experts investigate; communications lead updates stakeholders. Roles separate so they do not interrupt each other.

15. Pricing pitfalls

Modern observability vendors price on data volume. The cost can dwarf compute spend when ignored:

  • Datadog — per-host APM + Infrastructure + per-million custom metrics + per-GB logs + per-million events. The notorious failure mode: a new custom metric with a high-cardinality label (e.g., per-customer breakdown) suddenly emits millions of unique time series and the next month’s bill multiplies. The widely cited Coinbase Datadog spend disclosure (2022, ~$65M) crystallized this.
  • Splunk — historically per-GB-ingested-per-day; aggressive Splunk reduction has been a cottage industry. Recent shift to “Workload Pricing.”
  • New Relic — moved to per-GB-ingest plus per-user pricing in 2020; better economics for some workloads, worse for others.
  • Honeycomb — per-event pricing; flat per attribute count; designed for the wide-event model.
  • Sentry — per-event with reserved + on-demand tiers.

Mitigations:

  • Filter at the source — drop debug logs in prod, drop health-check spans, drop high-volume noisy metrics with no SLO bound to them.
  • Sample aggressively — tail-sample traces; keep 100% of errors and slow, sample 1% of normal.
  • Downsample old data — Thanos compactor can downsample to 5-minute and 1-hour resolution for older blocks.
  • Retention tiers — hot (7 days, full resolution), warm (30 days, downsampled), cold (1 year, archive).
  • Cost dashboards — monitor your monitoring spend; Datadog and Splunk both have built-in dashboards for cost attribution.

16. Common pitfalls

  • PII and secrets in logs/spans — credit card numbers, passwords, JWTs, API keys leaking into observability storage that has weaker access controls than the primary database. Use redaction processors (OTel attribute processor, Vector VRL, Splunk SEDCMD) at the collector. Never log full request bodies in payment flows.
  • High-cardinality metric labels — described above; the most common cause of Prometheus OOM and Datadog bill shock.
  • Time skew across services — distributed traces depend on aligned clocks. NTP / chrony / PTP keep services within tens of milliseconds; clock skew over the span duration produces nonsensical traces (child apparently completes before parent starts). Some backends compensate; better to fix the source.
  • Sampling tail-event loss — head-based sampling at 1% means 99% of errors are lost. Use always-keep-errors sampling policies and tail-based sampling for slow traces.
  • Dashboard sprawl and drift — hundreds of dashboards, half are stale, the alert links go to dashboards no one maintains. Periodic dashboard review and “owned by a team” tagging is the only durable answer.
  • Alerting on causes vs symptoms — “CPU > 95%” is a cause; “p99 latency > 200 ms” is a symptom. Cause-alerts page on conditions that may not affect users (high CPU on a non-bottleneck service); symptom-alerts page on what users feel. Page on symptoms, dashboard on causes.
  • Untested alerts — alerts that never fire (broken query, deleted metric) are silent. Periodically test by inducing the condition or by inspecting alert evaluation history.
  • Missing context in alerts — “DiskSpaceHigh on prod-db-7” is unactionable at 3 a.m. Include the runbook link, the current value, the SLO impacted, the deploy that preceded.
  • Logging at the wrong level — debug in prod blows up volume; error reserved for “needs human attention” otherwise dilutes the error tracking signal.
  • Tracing without W3C propagation — services that strip headers between proxies break trace continuity. Audit every hop.
  • Per-service observability silos — service A on Datadog, service B on New Relic, service C on Splunk. The trace stops at the boundary. Standardize on OpenTelemetry as the wire format; vendors become interchangeable endpoints.
  • Treating dashboards as documentation — dashboards drift, panels break, queries hardcode service names. The runbook in the alert and the README in the service repo are the durable artifacts; dashboards are a view, not the source of truth.
  • Over-instrumentation early. Hundreds of bespoke spans before the service has shipped its first user trace. Auto-instrumentation first, manual spans where they pay for themselves (business boundaries, third-party calls).
  • Cardinality from query parameters. Routing every value of ?utm_source= into a label. The fix is route templating at the framework level — /users/:id not /users/12345 — before metrics emission.
  • Trace context dropped at message queues. A service hands off to Kafka and the downstream consumer has no traceparent; the trace appears to end at the producer. Fix: always inject context into message headers (Kafka headers, SQS message attributes, NATS metadata) and extract on consume.
  • Forgetting outbound dependencies. Auto-instrumentation covers HTTP and DB clients; many teams forget Redis, the in-house RPC framework, or the cloud SDK calls. Each gap is a black hole in the trace.
  • Confusing histogram quantiles. histogram_quantile(0.99, rate(bucket[5m])) over a metric with too-wide buckets gives lies — the p99 falls inside a bucket but the function can only return the bucket boundary. Bucket selection matters; native histograms (Prometheus 2.40+) and OTel exponential histograms eliminate the problem.
  • Reporting MTTR without distinguishing detection vs mitigation vs full-recovery. “MTTR” is conflated; better to track Time to Detect, Time to Acknowledge, Time to Mitigate (impact ends), Time to Resolve (root cause fixed) as separate numbers.

17. Cross-references


18. Citations

  • Beyer, Jones, Petoff, Murphy (eds.), Site Reliability Engineering, O’Reilly, 2016 — the founding SRE text; SLI/SLO/SLA/error-budget chapters.
  • Beyer, Murphy, Rensin, Kawahara, Thorne (eds.), The Site Reliability Workbook, O’Reilly, 2018 — SLO engineering, burn-rate alerts, on-call.
  • Adkins, Beyer, Blankinship, Lewandowski, Oprea, Stubblefield, Building Secure and Reliable Systems, O’Reilly, 2020 — third in the Google SRE trilogy.
  • Majors, Fong-Jones, Miranda, Observability Engineering, O’Reilly, 2022 — high-cardinality wide-event approach.
  • Gregg, Systems Performance: Enterprise and the Cloud, 2nd ed., Addison-Wesley, 2020 — eBPF, perf, flame graphs.
  • W3C Trace Context, W3C Recommendation, Level 1 (Feb 2020), Level 2 in development.
  • OpenTelemetry Specification, opentelemetry.io/docs/specs/.
  • OpenMetrics Specification, openmetrics.io.
  • CNCF Cloud Native Landscape — Observability and Analysis category, landscape.cncf.io.
  • Prometheus documentation, prometheus.io/docs/.
  • Grafana Labs Loki, Tempo, Mimir, Pyroscope documentation, grafana.com/docs/.
  • Polar Signals Parca documentation, parca.dev.
  • Honeycomb engineering blog — the canonical advocacy for wide events and high cardinality.
  • Liz Fong-Jones talks on “Observability-Driven Development” — applying the wide-events approach during development, not only debugging.
  • Cindy Sridharan, Distributed Systems Observability, O’Reilly Report, 2018 — the early synthesis that popularized the three-pillars framing.
  • Charity Majors blog (charity.wtf) — many of the foundational essays on observability vs monitoring.
  • Brendan Gregg’s USE method (Utilization, Saturation, Errors) for resource analysis — complementary to Four Golden Signals for service analysis.
  • Tom Wilkie’s RED method (Rate, Errors, Duration) — the request-oriented sibling of Four Golden Signals.

19. Appendix — quick reference

A condensed mental model:

QuestionSignalTool example
Is the service up and how fast?MetricsPrometheus + Grafana
What did the service do for this user?LogsLoki / Elasticsearch / Splunk
Why was this request slow / failing?TracesTempo / Jaeger / Honeycomb
Which function was burning CPU?ProfilesPyroscope / Parca
Is the user still happy right now?RUMSentry / Datadog RUM
Did the deploy break something?Diff (traces, metrics, profiles) before/afterGrafana, Honeycomb, Pyroscope diff
Is my SLO healthy?Burn-rate over SLO windowPrometheus + multi-window alerts
Did somebody do something they shouldn’t?Security observability (eBPF, audit logs)Tetragon / Falco / CloudTrail

The unifying thread: every signal carries the same trace_id and resource attributes (service.name, k8s.pod.name, deploy.version), so any starting point pivots to any other.

End of reference.