Observability Tools Catalog
The map of the modern observability stack circa 2026 — metrics, logs, traces, profiles, RUM, synthetics, eBPF, incident management, status pages, and the AIOps / FinOps layers that have layered on top. The space is shaped by three forces in tension: the three-pillars model (Cindy Sridharan 2017 — metrics, logs, traces as distinct telemetry types), the OpenTelemetry consolidation (CNCF 2019 onward — vendor-neutral wire format and SDKs), and vendor SaaS economics (per-host, per-ingested-GB, per-event pricing models that drive most observability spend and most observability angst). On top of those, eBPF has supplanted in-process agents for many use cases since 2022, and AI/LLM-assisted observability (Datadog Bits AI, Grafana AI, New Relic AI, Honeycomb Query Assistant) has moved from demo to default in 2025-2026.
The selection axes that matter when picking an observability vendor or stack:
- Cardinality limits — high-cardinality dimensions (user_id, request_id, span_id) bankrupt traditional metrics systems but are the entire point of modern tracing/event-based tools.
- Retention — hot vs cold; trace sampling vs full-fidelity event capture; log rollup vs raw.
- Query model — PromQL (Prometheus + Mimir + Cortex + Thanos + Victoria), KQL (Kusto / Azure / Splunk SPL similar), LogQL (Loki), SQL (ClickHouse, Honeycomb, Axiom), DSL (Datadog tags, NRQL, Elasticsearch DSL).
- OSS vs SaaS — self-hosted Prometheus+Grafana+Loki+Tempo+OpenTelemetry vs Datadog/New Relic/Honeycomb.
- Pricing model — per host (Datadog, New Relic, Dynatrace), per ingested GB (Splunk, Sumo, Datadog Logs, OpenSearch Service), per event (Honeycomb, Axiom), per query (Athena-style — rare in observability).
- OpenTelemetry support — native OTLP ingest is now table-stakes; proprietary agent lock-in is a red flag in 2026.
- Integration breadth — number of supported tech stacks (Datadog leads with 800+).
- AI assistance maturity — query-from-natural-language, root-cause hints, anomaly summarization.
The three pillars
Metrics — numerical time-series; cheap to store; aggregate views (CPU%, request rate, error rate, latency p99). Best for dashboards and alerting; bad for “why did this one request fail?”
Logs — semi-structured textual events; expensive to store at scale; full-fidelity but unwieldy at petabyte scale. Best for forensic search and audit trails.
Traces — distributed request lineage; one trace = one user-action’s full call graph across microservices; cardinality-rich (every span has unique IDs).
The “unified observability” view (Honeycomb 2016 onward, popularized by Charity Majors + Christine Yen and the observability vs monitoring distinction) argues these three should converge into arbitrarily-wide structured events with high-cardinality dimensions queryable interactively. “Observability 2.0” framing in 2024+ pushes events-first; the Distributed Nosy Debugging (DnD) mode in Honeycomb is the canonical implementation.
OpenTelemetry now treats logs + metrics + traces + profiles (added 2024 as fourth signal) as a unified telemetry data model with OTLP as the wire format.
Metrics — Prometheus and the compatible-ecosystem
- Prometheus — CNCF 2012 (Matt Proud + Julius Volz at SoundCloud); CNCF graduated 2018; ~70k GitHub stars by 2026 — one of the most-deployed OSS projects on Earth. Pull-based scrape model; local TSDB; PromQL query language; Alertmanager; service discovery (Kubernetes, Consul, EC2, etc.); exposition format text-based; remote write to long-term storage. Limitations: single-node TSDB; high-cardinality kills it (per-pod-per-status-code metrics blow up); 15-day default retention; no HA out of the box.
- VictoriaMetrics — VictoriaMetrics Inc 2018 (Aliaksandr Valialkin, lead author of fasthttp Go library); Apache 2.0; drop-in Prometheus replacement with better cardinality, lower memory, longer retention; MetricsQL (PromQL superset); cluster mode (vmagent + vmselect + vminsert + vmstorage); commercial Enterprise.
- Grafana Mimir — Grafana Labs 2022 (formerly Cortex; Grafana hard-forked Cortex Mar 2022 after license dispute). AGPL; horizontally scalable Prometheus-compatible TSDB; multi-tenant; billion-active-series scale; replaces self-hosted Cortex in most new deployments.
- Thanos — Improbable 2017 → CNCF incubating; multi-cluster + long-term Prometheus storage; query federation; Apache 2.0; widely deployed at scale.
- Cortex — formerly OSS Prometheus-as-a-service; Grafana → Mimir fork divergence Mar 2022; Cortex continues at CNCF but Mimir now leads adoption.
- M3 — Uber 2018 → M3 Inc; Prometheus-compatible; high-cardinality target.
- Promscale — Timescale 2020-2022; deprecated 2023.
- AWS Managed Prometheus (AMP) — managed Cortex.
- GCP Managed Service for Prometheus — managed.
- Azure Managed Prometheus — managed.
- Chronosphere — Martin Mao + Rob Skillington (ex-Uber M3) 2019; SaaS Prometheus-compatible at scale; Series C 2022 1.6B; acquired Calyptia (Fluent Bit team) 2023.
- Levitate (Last9) — Indian SaaS Prometheus.
Push-based alternative metrics protocols:
- StatsD — Etsy 2011; UDP; ancient but ubiquitous.
- Carbon / Graphite — push-based; legacy.
- OpenTelemetry metrics — push-or-pull OTLP; modern unified protocol.
- InfluxDB line protocol — push; widely supported as Telegraf input/output.
Grafana — the visualization gravity well
- Grafana — Torkel Ödegaard 2014 (Stockholm); fork of Kibana 3.x. Apache 2.0 → AGPLv3 Apr 2021 (license change). Open-source; supports 100+ data sources via plugins (Prometheus, Loki, Tempo, Mimir, Pyroscope, Elasticsearch, InfluxDB, ClickHouse, Postgres, MySQL, BigQuery, Snowflake, Datadog, Splunk, CloudWatch, Stackdriver, Azure Monitor — and many more). Defining dashboard tool of the OSS observability era. Grafana Cloud SaaS launched 2020.
- Grafana Labs — Raj Dutt + Torkel 2014; “stack” strategy: Grafana (viz) + Loki (logs 2018) + Tempo (traces 2020) + Mimir (metrics 2022) + Pyroscope (profiles, acquired 2023) + OnCall (paging, acquired Amixr 2021) + k6 (load test, acquired k6.io 2021) + Beyla (zero-config eBPF auto-instrumentation 2023). Series D 2022 6B valuation**; ~$300M ARR 2025 (estimated).
- Grafana plugins ecosystem: data source plugins, panel plugins, app plugins.
- Grafana Mimir — see metrics.
Datadog — the SaaS giant
- Datadog — Olivier Pomel + Alexis Lê-Quôc 2010 (NYC; founders ex-Wireless Generation); NASDAQ IPO Sep 2019 (~54B 2021; ~3.1B ARR, ~25% YoY growth. Platform breadth: APM, infrastructure metrics, logs, RUM, synthetics, security (Cloud SIEM + CSPM + CWPP + CIEM), network monitoring (NPM + DNS), continuous profiler, CI Visibility, Test Optimization, Database Monitoring, LLM Observability (2024), workflows / case management, incident management, Cloud Cost Management, Watchdog AI anomaly detection, Bits AI co-pilot. Acquisitions: Madumbo 2017, Logmatic.io 2017 (logs core), Stackstate (parts) — N/A actually rumored not closed, MetaPlane 2024 (data observability), Quickwit Mar 2024 ~$200M reported (Rust log search → became Datadog Flex Logs basis), Variance 2024, Eppo 2024 (experimentation), Lyrid 2024, Codiga 2022, Sqreen 2021, Timber 2021 (Vector + Timber.io agents), Hdiv Security 2022, CoScreen 2021, Snorkel AI — rumored.
- Watchdog AI — anomaly detection across metrics + logs + RUM since 2018; Bits AI assistant 2023+.
- Vector (formerly Timber.io) — Datadog’s open-source telemetry agent / pipeline; Rust; high-perf log/metric/trace shaping. Observability Pipelines product 2023.
New Relic, Dynatrace, AppDynamics — APM elder statesmen
- New Relic — Lew Cirne 2008 (anagram of his name); NYSE IPO 2014; taken private Aug 2023 by Francisco Partners + TPG for $6.5B. New Relic One platform: APM + Infrastructure + Browser + Mobile + Synthetics + Logs + Network + Distributed Tracing + Errors Inbox + Vulnerability Management + Pixie + Codestream. NRQL query language (SQL-ish); user-priced tiered model 2020 reform.
- Dynatrace — Bernd Greifeneder 2005 (Linz, Austria); was Compuware APM acquisition 2011 then spun out 2014 by Thoma Bravo; NYSE IPO Aug 2019 (~14B market cap 2026**; FY2025 ARR ~$1.7B. Davis AI causal-AI engine since 2019 (different paradigm than ML-anomaly — graph-based root cause). OneAgent auto-instrumentation; SaaS-only.
- AppDynamics — Jyoti Bansal 2008; Cisco acquired Jan 2017 for $3.7B (announced day before planned IPO). JVM/CLR/PHP/Python/.NET agents; APM heritage. Integrated into Cisco Full-Stack Observability platform; rebranded Cisco AppDynamics 2024.
- Cisco Full-Stack Observability — Cisco’s unified observability suite combining AppDynamics + ThousandEyes + Splunk (post-acquisition); built on OpenTelemetry.
Splunk and the SIEM-adjacent logs giants
- Splunk — Michael Baum + Rob Das + Erik Swan 2003 (San Francisco); IPO 2012; Cisco acquired Mar 2024 for $28B (largest tech acquisition in 2024). SPL (Search Processing Language); index-time + search-time schema; expensive per-GB-ingested pricing (historic complaint); broad enterprise SIEM presence; Splunk Observability Cloud (combines SignalFx + VictorOps + Plumbr + Omnition + Rigor acquisitions). Now Cisco-integrated with AppDynamics.
- SignalFx — 2013 (Karthik Rau ex-VMware vSAN); streaming analytics; Splunk acquired Sep 2019 $1.05B; integrated as Splunk Infrastructure Monitoring.
- VictorOps — 2012 (Dan Jones + Todd Vernon); Splunk acquired Jun 2018 $120M; integrated as Splunk On-Call.
- Sumo Logic — Bruno Kurtic + Christian Beedgen + Kumar Saurabh 2010 (Redwood City); IPO Sep 2020 (1.7B**. Cloud-native log analytics + metrics + security; ingest-priced.
- Elastic Observability — Elastic Stack + APM agents; OSS + commercial. Logs + metrics + APM + RUM + synthetics + uptime; built on Elasticsearch.
- InfluxData / TICK stack — Telegraf + InfluxDB + Chronograf + Kapacitor; legacy bundle; InfluxData pivoting to InfluxDB 3.x IOx (Rust + Parquet) with broader observability play.
- Wavefront / VMware Tanzu Observability / Aria Operations for Applications — VMware 2017 acquisition $325M; rebranded under Aria 2022; now Broadcom (post VMware acquisition).
- LogicMonitor — SaaS hybrid observability; PE-owned (Vista Equity).
- ScienceLogic — IT operations management; PE-owned.
- OpsRamp — HPE acquired 2023.
Honeycomb, Lightstep, Charity Majors and the events-first wave
- Honeycomb — Charity Majors + Christine Yen + Ben Hartshorne 2016 (founders ex-Parse Facebook). Arbitrarily-wide structured events model; high-cardinality dimensions; BubbleUp root-cause analysis (anomaly explanation); Distributed Nosy Debugging (DnD) mode; tracing-first design. Series D May 2024 700M; ~$50M ARR 2025 (estimated).
- Lightstep — Ben Sigelman + Daniel Spoonhower + Spoons Wu 2015; Sigelman was original Google Dapper paper co-author (2010); ServiceNow acquired May 2021 (undisclosed; reportedly $300M+ range); integrated as ServiceNow Cloud Observability.
- Cribl — Clint Sharp + Ledion Bitincka + Dritan Bitincka 2018 (founders ex-Splunk); telemetry pipelines category creator; Cribl Stream + Edge + Search + Lake; Series E 2024 3.5B valuation**; ARR ~$200M+.
- Edge Delta — Ozan Unlu 2019; agent-side observability — analyze + filter + rollup at the edge before sending; Series B 2024 $63M.
- HyperDX — 2023; OSS observability + LLM assistant; YC W23.
- Coroot — open-source eBPF observability + cost monitoring.
- Quickwit — Datadog acquired Mar 2024; OSS Rust log search.
- Axiom — 2020 (London); event-based; ClickHouse-style columnar storage; Apache Parquet backing; SQL query; per-event pricing model; raised Series A 2022 $20M.
- OpenObserve — 2022; OSS observability stack; Rust + columnar; Datadog/Loki alternative.
Cloud-native managed observability
- AWS CloudWatch — metrics + logs + alarms + dashboards + Container Insights + Synthetics + RUM + Application Insights + Internet Monitor + Network Monitor; CloudWatch Logs Insights query language. Pricing per metric + per ingested-GB-logs + per query-scanned-GB.
- AWS X-Ray — distributed tracing; now overshadowed by AWS Distro for OpenTelemetry (ADOT).
- AWS Managed Service for Prometheus (AMP) — managed Cortex.
- AWS Managed Service for Grafana (AMG) — managed Grafana.
- AWS CloudTrail — audit logging; not traditionally observability but adjacent.
- GCP Cloud Operations Suite (formerly Stackdriver — 2014 Stackdriver acquisition → rebranded 2020): Cloud Monitoring + Cloud Logging + Cloud Trace + Cloud Profiler + Cloud Debugger (deprecated 2023) + Error Reporting + Service Monitoring.
- GCP Managed Prometheus + Managed Grafana — 2022.
- Azure Monitor — umbrella for Application Insights (APM, from Microsoft Visual Studio acquisition lineage) + Log Analytics workspaces (Kusto KQL backend) + Metrics + Network Watcher + Container Insights + VM Insights.
- Azure Managed Grafana / Managed Prometheus — 2023.
- Oracle Cloud Observability — OCI Logging + Monitoring + APM.
- IBM Instana — APM; IBM acquired 2020.
Logs — the ELK stack and beyond
- Elastic Stack (ELK) — Elasticsearch + Logstash + Kibana + Beats (Filebeat, Metricbeat, Heartbeat, Packetbeat, Auditbeat, Functionbeat, Winlogbeat). Foundational logs platform 2010-2022. Licensing: Apache 2.0 → SSPL/ELv2 Jan 2021 → triple-licensed (added AGPLv3) Sep 2024.
- OpenSearch — AWS-led fork Apr 2021; now under OpenSearch Software Foundation Linux Foundation Sep 2024. Apache 2.0; AWS-managed.
- Grafana Loki — Grafana Labs 2018; label-based like Prometheus (not full-text-indexed); cheap storage on S3; LogQL query language; Promtail agent; complementary to Mimir + Tempo.
- Fluentd — Treasure Data 2011; CNCF graduated; Ruby + C; ubiquitous log shipper.
- Fluent Bit — Treasure Data 2015; C; lightweight Fluentd; embedded in containers; default container log forwarder in Kubernetes ecosystem; Calyptia commercial (acquired by Chronosphere 2023).
- Vector — Timber.io → Datadog; Rust; high-throughput agent/pipeline.
- Logstash — Elastic 2009 (Jordan Sissel); JRuby; pluggable input/filter/output; heavy memory.
- Filebeat — Elastic; lightweight log shipper.
- rsyslog / syslog-ng — classic syslog daemons; still everywhere in non-K8s estates.
- Datadog Logs — ingestion-priced (1.27/M events processed standard pricing).
- Splunk — see above.
- Sumo Logic — see above.
- Loggly — SolarWinds 2018; legacy SaaS.
- Papertrail — SolarWinds; legacy tail-and-search.
- Better Stack Logs (formerly Logtail) — 2021; SaaS; flat-rate pricing.
- Mezmo (formerly LogDNA) — pivoted 2022 to telemetry pipeline + logs.
- Coralogix — 2015 Tel Aviv; SaaS; Streama architecture (analyze at ingest); Series C 2022 1B.
- Axiom — see above; event-based; Apache Parquet backing.
- HyperDX — see above.
- Edge Delta — see above.
- Cribl Stream — telemetry pipeline rather than log destination; routes to many backends.
- Logz.io — 2014 Tel Aviv; managed ELK + traces; Series D 2022 $145M.
- Graylog — open-source SIEM-adjacent; commercial Graylog Operations.
- Quickwit — see above.
Traces — distributed tracing and OpenTelemetry
- Distributed tracing genesis — Google Dapper paper 2010 (Benjamin Sigelman + Luiz André Barroso + Mike Burrows); inspired everything since.
- OpenTracing — CNCF 2016; vendor-neutral API; merged into OpenTelemetry 2019.
- OpenCensus — Google 2018; merged into OpenTelemetry 2019.
- OpenTelemetry — CNCF 2019 merger of OpenTracing + OpenCensus; now CNCF graduated 2024 (graduated in metrics + traces; logs still incubating; profiles signal added 2024). Standards: OTLP wire format (HTTP + gRPC); OTel SDK per language (Java, Go, Python, JS/Node, .NET, Ruby, PHP, Rust, C++, Swift, Erlang, Elixir); OTel Collector (vendor-agnostic receiver + processor + exporter pipeline). Single largest CNCF project by contributor count.
- OTel auto-instrumentation — Java + Python + .NET + Node have official auto-instrumentation agents (no code changes).
- Jaeger — Uber 2015 (Yuri Shkuro et al; inspired by Zipkin + Dapper); CNCF graduated 2019; backends: Cassandra, Elasticsearch, OpenSearch, ClickHouse; Apache 2.0; Jaeger v2 2024 rebuilt on OTel Collector.
- Zipkin — Twitter 2012; Java; the original OSS tracer; Apache 2.0; less actively developed.
- Grafana Tempo — Grafana Labs 2020; high-volume cheap-storage trace backend; object-storage backed; pairs with Loki + Mimir.
- Datadog APM — proprietary; integrates with OTel via OTLP intake.
- New Relic APM — proprietary + OTel via OTLP.
- Dynatrace — OneAgent + OTel via OTLP intake.
- Honeycomb — events-first with full trace context.
- AWS X-Ray — and ADOT (AWS Distro for OpenTelemetry).
- GCP Cloud Trace — Stackdriver lineage; OTel support.
- Azure Application Insights — OTel support.
- SigNoz — 2021 OSS APM (ClickHouse backend); $6.5M seed 2022.
- Uptrace — 2021 OSS APM (ClickHouse).
- Tracetest — 2022; trace-based testing (assertions against trace spans); part of Kubeshop.
Profiling — continuous profiling category
- Pyroscope — Pyroscope Inc 2020 → Grafana Labs acquired 2023; integrated as Grafana Pyroscope. Continuous profiling; flame graphs; eBPF-based language-agnostic + JVM/Go/Python/Ruby/Node specific.
- Parca — Polar Signals 2021 (Frederic Branczyk; ex-Red Hat Prometheus team); eBPF-based; OSS Apache 2.0; Polar Signals Cloud managed.
- Polar Signals Cloud — managed Parca; Series A 2022 $4M.
- Pixie — Pixie Labs → New Relic acquired 2020; eBPF auto-instrumentation on-cluster; donated to CNCF as sandbox project; tied closely to NR but OSS Apache 2.0.
- Datadog Continuous Profiler — JVM/Python/Go/Node/Ruby/.NET continuous profiling integrated with APM.
- Sentry Profiling — 2023 GA; mobile + backend.
- Granulate — gProfiler OSS; Intel acquired 2022 $650M; performance auto-optimization.
- Brendan Gregg’s tools — flamegraph.pl + perf + bpftrace + BCC; the canonical profiling toolkit ancestors.
- eBPF parca-agent — Polar Signals’ agent.
- async-profiler — JVM-specific; popular sampler.
- OpenTelemetry profiles signal — 2024 added as fourth signal; specs still stabilizing.
RUM, frontend, and session replay
- Sentry — David Cramer + Chris Jennings 2008 (originally Disqus internal); Functional Software Inc (Sentry parent); errors + performance + session replay + crons + feedback + LLM observability; FOSS BSL since 2019 (relicensed; was BSD-3). Revenue **~90M at $3B valuation.
- Bugsnag — David Haney + James Smith 2013; SmartBear acquired 2021 (undisclosed); error monitoring.
- Raygun — Mindscape (NZ); error + APM + RUM.
- Rollbar — Brian Rue + Cory Virok 2012; error monitoring; Series B 2019 $11M.
- Honeybadger — Indie / boostrapped; small team; loyal user base.
- AppSignal — Amsterdam; Ruby/Elixir/Node APM + error + host.
- Datadog RUM — frontend monitoring + session replay launched 2020.
- New Relic Browser — JS agent for SPA telemetry.
- Dynatrace RUM — OneAgent + JS agent.
- Akamai mPulse — CDN-tied RUM.
- SpeedCurve — 2014 Mark Zeman; performance budgets + RUM; Catchpoint acquired 2024.
- Calibre — Karolina Szczur + Ben Schwarz 2017; performance budgets; site-monitoring; Catchpoint acquired 2024 as part of SpeedCurve.
- RUMVision — Annelinde van Diemen + Wouter Postma 2023 (Netherlands); SaaS RUM specialized for Core Web Vitals.
- Web Vitals — Google 2020 standard: LCP (Largest Contentful Paint), FID (First Input Delay, replaced by INP Interaction to Next Paint in Mar 2024), CLS (Cumulative Layout Shift), plus TTFB (Time to First Byte) and FCP (First Contentful Paint).
Session replay specifically:
- FullStory — Scott Voigt 2014; **Permira PE recap 2022 at 130M ARR ~2022.
- Hotjar — David Darmanin 2014 (Malta); Contentsquare acquired 2021 $400M+; heatmaps + recordings + surveys.
- Contentsquare — 2012 Jonathan Cherki (France); digital experience analytics; raised Series F 2022 5.6B; acquired Hotjar 2021.
- LogRocket — Matthew Arbesfeld + Ben Edelstein 2016; session replay + error + performance + product analytics; Series C 2021 $35M.
- PostHog — James Hawkins + Tim Glaser 2020 (London); OSS product analytics + session replay + feature flags + experiments + LLM observability + error tracking; ClickHouse-backed; YC W20; Series C 2024 750M.
- Microsoft Clarity — Microsoft 2020; free session replay + heatmaps; integrated with Bing/Edge data.
- Smartlook — Czech; ~$10M ARR; QualiTV acquired 2024.
- Mouseflow — Copenhagen; heatmaps + replay.
- Heap — product analytics + auto-capture; Contentsquare acquired Sep 2024 $590M.
- Mixpanel — product analytics with replay since 2023.
Synthetic monitoring — proactive uptime + perf
- Pingdom — 1998 (Sweden); pioneer; SolarWinds acquired 2014 $40M; simple HTTP pings + transaction checks.
- Datadog Synthetics — 2019 launch; HTTP/multi-step/browser via Puppeteer.
- New Relic Synthetics — multi-step + browser checks.
- Catchpoint — 2008 (Akamai veterans Mehdi Daoudi + others); enterprise synthetic + RUM + DNS; private growth-equity backed; ~$100M+ ARR; acquired SpeedCurve + Calibre 2024.
- ThousandEyes — Mohit Lad + Ricardo Oliveira 2010; network path visibility; Cisco acquired 2020 $1B; integrated as Cisco ThousandEyes.
- UptimeRobot — 2010; bootstrapped; free tier; Idera acquired ~2020.
- Better Stack Uptime (formerly Better Uptime) — 2021; status pages + uptime + Logtail (logs); Estonia.
- Checkly — Tim Nolet + Hannes Lenke 2018; Playwright-based synthetic; monitoring-as-code; Series A 2022 $10M.
- AlertSite / SmartBear AlertSite — synthetic SaaS.
- StatusCake — 2012; SaaS uptime + page-speed; TPG / Idera owned.
- Site24x7 — Zoho’s monitoring suite.
- WebPageTest — Patrick Meenan 2008 (AOL engineer); free + commercial via Catchpoint (acquired 2020). Per-page synthetic loads with detailed waterfall + filmstrip + lighthouse + Web Vitals.
- Lighthouse CI — Google 2019; lab-mode page perf in CI/CD.
- k6 — Grafana Labs 2017 (acquired k6.io 2021); load-testing + synthetic-monitoring k6 Cloud; JS scripted scenarios.
- Gatling — load testing; Scala/Java + open source + Enterprise.
- Locust — Python; OSS.
eBPF — the kernel-instrumentation revolution
eBPF (extended Berkeley Packet Filter) since Linux 3.18 (2014) allows safe in-kernel programs without modules; observability industry adopted from 2018 with Brendan Gregg / BCC tools popularization and Cilium + Falco + Pixie + Parca + Beyla + Tetragon building product layers on top.
- Cilium — Isovalent (Thomas Graf + Daniel Borkmann 2015 originally as Linux netfilter project, founded company 2017); CNCF graduated 2023; Cisco acquired Isovalent Apr 2024 ~$700M reported; eBPF-based Kubernetes networking + observability + security; Hubble (observability), Tetragon (runtime security).
- Pixie — see profiling; on-cluster eBPF + DataScript.
- Parca — see profiling.
- Beyla — Grafana Labs 2023; lightweight zero-config eBPF for HTTP/gRPC + auto-instrumentation; OSS Apache 2.0; pairs with Tempo/Mimir/Loki.
- Coroot — 2022 OSS eBPF observability + cost; one-installer Kubernetes APM alternative.
- Falco — Sysdig 2016 → CNCF graduated 2024; runtime security + audit via eBPF rules; ~6k stars.
- Sysdig — Loris Degioanni 2013 (Wireshark co-creator); container security + observability; Series H 2022 2.5B.
- Tetragon — Isovalent → Cisco; eBPF runtime security policy enforcement.
- Inspektor Gadget — Kinvolk → Microsoft 2021 acquisition; Kubernetes-native eBPF toolkit.
- eBPF Foundation — Linux Foundation 2021; governance; Meta + Google + Isovalent + Microsoft + Netflix + Red Hat founding.
- BCC tools — IO Visor Project; ~150 ready-made eBPF tools (execsnoop, opensnoop, biolatency, tcptracer, etc.); from Brendan Gregg + colleagues at Netflix.
- bpftrace — DTrace-like high-level language for eBPF; Brendan Gregg + Alastair Robertson.
- libbpf — modern C library for eBPF program loading.
- eunomia-bpf — Wasm + eBPF combination.
Incident management and paging
- PagerDuty — Alex Solomon + Andrew Miklas + Baskar Puvanathasan 2009 (founders ex-Amazon); NYSE IPO Apr 2019; **~475M. Schedules + escalations + integrations (~700+); Event Intelligence ML; AIOps add-on; Process Automation acquired 2022 (Rundeck $100M + Catalytic 2022); Jeli acquired 2023 (post-incident reviews / incident analysis).
- Opsgenie — 2012; Atlassian acquired Sep 2018 $295M; bundled with Jira Service Management.
- Atlassian Statuspage — see status pages.
- VictorOps — see Splunk; now Splunk On-Call.
- xMatters — 2003; Everbridge acquired 2021 $245M.
- Squadcast — 2017 (Bengaluru); SaaS incident response; Series A 2024 $10M.
- Rootly — Quentin Rousseau + JJ Tang 2020 (Slack-native incident management); YC W21; raised Series B 2024 $20M.
- FireHydrant — Robert Ross + Bobby Tables 2019; incident management + retros; Series B 2022 $23M.
- incident.io — Stephen Whitworth + Pete Hamilton + Chris Evans 2021 (London); Slack-native; Series B 2024 400M valuation; runs on Slack + MS Teams; ~$15M ARR 2024.
- Grafana OnCall — Grafana 2021 (acquired Amixr 2021); OSS + cloud; integrates Grafana alerting → escalations.
- Spike.sh — Indian SaaS incident response.
- Zenduty — Indian incident management.
- GitHub Incident Management — built-in IM for GitHub; basic.
- Datadog Incident Management — 2022 GA; ties into Watchdog + RUM context.
- AWS Systems Manager Incident Manager — AWS-native; integrates CloudWatch alarms.
- JIRA Service Management Incidents — Atlassian’s ITSM incident workflow.
SRE foundations:
- SRE Book — Beyer + Jones + Petoff + Murphy (Google 2016, O’Reilly).
- SRE Workbook — Beyer + Murphy + Rensin + Kawahara + Thorne (Google 2018, O’Reilly).
- SLI / SLO / SLA terminology; error budgets as the SRE-Dev contract.
- Nines culture (99.9% / 99.99% / 99.999%).
Status pages
- Atlassian Statuspage — was StatusPage.io; founders Steve Klein + Danny Olinsky + Scott Klein 2013; Atlassian acquired 2016 $50M; dominant enterprise status-page tool.
- Better Stack Status — 2022; bundled with Better Stack Uptime + Logs.
- Instatus — Mert Bulan 2019 (Turkey); affordable status pages.
- Statushub / status.io / Hund — niche status-page SaaS.
- Cachet — OSS PHP status page; in maintenance.
- UptimeRobot Status Pages — bundled with UptimeRobot monitoring.
- GitHub Status / AWS Service Health Dashboard / GCP Status — vendor-managed examples.
AIOps and anomaly detection
- Moogsoft — 2012 Phil Tee (ex-Riverbed); event correlation + AIOps; Dell acquired Apr 2023 200M range.
- BigPanda — 2012; AIOps event correlation; Series E 2021 1.2B.
- Datadog Watchdog — built-in anomaly detection across metrics + logs + RUM since 2018; Bits AI copilot since 2023.
- Dynatrace Davis AI — causal-AI graph-based root cause since 2019; differentiator.
- Splunk IT Service Intelligence (ITSI) — service-level AIOps.
- OpsRamp — HPE acquired 2023; AIOps + observability.
- ScienceLogic — IT operations + ML.
- LogicMonitor — hybrid observability + LogicMonitor Edwin AI 2024.
- Cisco AppDynamics Smart Agents — AI-assisted distributed system understanding.
- New Relic AI — 2023; LLM-based assistance.
- Honeycomb Query Assistant — natural-language to query.
- Grafana AI — 2024+; auto-summarization + assistants.
Cost observability and FinOps
The financial observability layer that grew alongside cloud-native observability in 2020-2025.
- Datadog Cloud Cost Management — 2023 GA.
- Vantage — 2020 (founders ex-DigitalOcean); multi-cloud cost; raised Series B 2024 $21M.
- CloudZero — 2016 (Boston); unit economics-focused cost; raised Series C 2023 $32M.
- Kubecost — 2019 → IBM acquired Stratos / Kubecost Aug 2024 $150M+; CNCF sandbox; Kubernetes cost allocation; OSS + commercial.
- Cast.ai — 2019 Lithuanian + US; Kubernetes auto-scaling + cost optimization; Series C 2024 850M.
- Komodor — 2020 Tel Aviv; Kubernetes troubleshooting + cost; Series B 2024 $42M.
- Holori — French cost observability.
- Spot.io — formerly Spotinst (2015 Tel Aviv); NetApp acquired 2020 $450M; spot-instance + auto-scaling savings.
- Densify — Toronto; right-sizing optimization.
- IBM Turbonomic — IBM acquired Turbonomic 2021 $1.5B; application resource management.
- Apptio — TBM (Technology Business Management); IBM acquired Aug 2023 $4.6B; financial / cost transparency.
- Yotascale — 2015; cost SaaS.
- Anodot — anomaly + cost AIOps; raised Series C 2022 $35M.
- FinOps Foundation — Linux Foundation 2020; standards body for cloud cost discipline; FOCUS specification 2024.
OpenTelemetry stack composition
The OTel stack you’d assemble in 2026:
- Language SDKs — instrument application code; auto-instrumentation agents for Java/Python/.NET/Node; manual SDK for Go/Rust/C++/Ruby/PHP/Swift.
- OTel Collector — vendor-agnostic processor pipeline; runs as agent (per-host/per-pod) or gateway (centralized); receivers (OTLP/Prometheus/Jaeger/Zipkin/Fluent/Statsd/Kafka/AWS-CW/GCP-Monitoring/Splunk-HEC/etc.); processors (batch/memorylimit/attributes/transform/filter/tail-sampling/k8s-attributes); exporters (Datadog/NewRelic/Honeycomb/Splunk/Jaeger/Zipkin/Tempo/Loki/Prometheus/Kafka/file/etc.).
- OTLP wire format — HTTP/gRPC; protobuf payload.
- Backend — vendor (Datadog, Honeycomb, New Relic, Dynatrace, Lightstep) or OSS (Grafana stack, SigNoz, ClickHouse, OpenSearch, Jaeger).
- Semantic conventions —
service.name,http.method,db.system,messaging.system, etc. — standardize attribute naming across the ecosystem.
OpenTelemetry’s two biggest 2024-2025 milestones: profiles signal added as fourth pillar, and logs SDK stabilized to GA in major languages.
Telemetry pipelines — the routing/shaping layer
The “ETL for observability” category that exploded 2021-2024:
- Cribl Stream + Edge + Search + Lake — category-defining; Series E 2024 3.5B; ~$200M+ ARR.
- Datadog Observability Pipelines — built on Vector; 2023 GA.
- Edge Delta — agent-side processing.
- Mezmo Telemetry Pipeline — formerly LogDNA; pivoted to pipeline play.
- OpenTelemetry Collector — neutral pipeline option; many use it as a free Cribl-alternative.
- Vector (Datadog OSS) — Rust; runs standalone outside of Cribl/Datadog SaaS.
- Fluent Bit / Fluentd — classic log pipeline; Datadog/Splunk/ELK destinations.
- Logstash — Elastic-native pipeline.
- AWS Firehose — managed shipping to S3/Redshift/OpenSearch/Splunk.
- Kafka + ksqlDB / Flink — DIY pipeline pattern.
Common pipeline shaping operations: filtering, sampling (tail-based + head-based), aggregation, rollup, PII redaction, format conversion (Splunk-HEC ↔ OTLP ↔ Datadog ↔ NewRelic), enrichment (k8s metadata, geo-IP), routing (logs → cheap S3 archive + expensive index hot tier).
Selection criteria summary
| Criterion | High-cardinality friendly | Cheap retention | Vendor-neutral | Mature AI assist |
|---|---|---|---|---|
| Honeycomb | Strong | Med | OTel-native | Yes (Query Asst) |
| Datadog | Med (caps + indexed) | Cost-heavy | OTLP intake | Yes (Bits AI + Watchdog) |
| New Relic | Strong (NRDB) | Med | OTLP intake | Yes (NR AI) |
| Dynatrace | Med | Med | OneAgent + OTLP | Yes (Davis AI) |
| Splunk | Strong | Cost-heavy | OTLP via HEC | Yes (ITSI + SPL2) |
| Grafana stack | Strong (Mimir+Tempo+Loki) | Cheap (S3) | Native OSS | Growing (Grafana AI) |
| Elastic | Med | Cost-heavy | OTLP intake | Yes (ESRE) |
| ClickHouse-based (SigNoz/Axiom/Coralogix) | Strong | Cheap | OTel-native | Varies |
Adjacent
- database-engine-taxonomy — many observability tools are built on TSDB / OLAP DB technology (ClickHouse for SigNoz/Axiom; Prometheus TSDB; Elasticsearch).
- llm-landscape — LLM observability is now a category in itself (Datadog LLM Obs, Honeycomb, Arize, LangSmith, Helicone).
- ml-framework-comparison — ML training pipelines feed into observability (model performance monitoring overlap with APM).
- distributed-systems — distributed tracing concepts (Dapper paper, span context propagation) tie to the broader distributed systems literature.
- kubernetes — most observability tools deploy and instrument primarily on Kubernetes today.
- networking — eBPF observability and network monitoring (Cilium Hubble, ThousandEyes, NPM) overlap with networking layer.