eBPF & Kernel Observability
eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that lets sandboxed programs run in kernel space at well-defined attach points — network ingress, syscalls, function entry/exit, scheduler events, security hooks — without modifying kernel source or loading kernel modules. It has become the substrate for the modern Linux observability, security, and networking stack: Cilium for cluster networking, Pixie and Parca for auto-instrumented observability, Falco and Tetragon for runtime security, Hubble for network visibility, and a flood of cloud-native data-plane tools.
History
Classic BPF (1992)
The original BPF was introduced by Steven McCanne and Van Jacobson at LBL in the 1992 USENIX paper “The BSD Packet Filter: A New Architecture for User-level Packet Capture”. It defined a small RISC-like VM (32-bit, two registers A and X, a small scratch area) used by tcpdump and libpcap to push packet-matching predicates into the kernel so packets could be filtered before the costly copy to userspace. cBPF found its way into Linux for socket filters (SO_ATTACH_FILTER), seccomp-bpf (syscall filtering), and many other niches.
Extended BPF (2014)
Alexei Starovoitov (at PLUMgrid, later Facebook/Meta, now Isovalent/Cisco) led the work to generalise BPF into a 64-bit, 11-register, JIT-compiled VM with a verifier — landing in Linux 3.18 (December 2014). The key additions:
- 64-bit registers (R0-R10, R10 read-only frame pointer) mapping efficiently to x86_64/arm64 registers.
- Maps — kernel data structures shared between BPF programs and userspace.
- Helper functions — a curated set of kernel functions BPF can call.
- Verifier — proves termination, memory safety, and bounded execution before load.
- JIT — bytecode compiled to native machine code on first load.
- Attach points beyond sockets — kprobes, tracepoints, uprobes, perf events, XDP, tc, cgroup, LSM, fentry/fexit.
Daniel Borkmann (PLUMgrid → Cisco/Isovalent) co-led the work; the two remain BPF subsystem co-maintainers as of 2025.
Architecture
Toolchain
A BPF program is typically:
- Written in restricted C (no unbounded loops, no
printk, limited stack, bounded array bounds). - Compiled by Clang/LLVM with
-target bpfto produce ELF object files containing BPF bytecode in.textsections plus map definitions in.maps. - Loaded into the kernel via the
bpf()syscall, usually through libbpf (the canonical C loader library) or one of its wrappers. - Verified, JITed, and attached to an attach point.
The Verifier
The verifier is BPF’s most distinctive feature. It performs static analysis on the bytecode to prove:
- Termination — no unbounded loops; bounded loops are unrolled or use the
bpf_loop()helper with an iteration cap. - Memory safety — every pointer access is bounded; verifier tracks register types (
PTR_TO_CTX,PTR_TO_MAP_VALUE,PTR_TO_STACK, etc.) and ranges. - No undefined behaviour — uninitialised reads, division by zero, etc.
- Bounded instruction count — historically 4096 instructions per program; expanded to ~1M complex insn limit in Linux 5.x.
The verifier reports rich error messages on rejection. A common eBPF developer rite of passage is reading verifier traces.
CO-RE and BTF
CO-RE (Compile Once — Run Everywhere) plus BTF (BPF Type Format) solve the portability problem. Kernels have different struct layouts across versions, so a BPF program compiled against one kernel’s headers historically wouldn’t load on another. CO-RE (introduced ~Linux 5.2, matured 5.4+) gives BPF programs the ability to:
- Compile against BTF type information instead of raw struct offsets.
- Relocate field accesses at load time based on the running kernel’s BTF.
- Ship a single binary that works across kernel versions.
BTF is essentially DWARF debug info, slimmed for BPF use. Modern distributions ship /sys/kernel/btf/vmlinux for the running kernel.
Maps
BPF maps are kernel data structures shared between BPF programs and userspace. Map types:
- Hash, LRU hash, percpu hash — key-value lookup.
- Array, percpu array — indexed.
- Ringbuf, perf event array — bounded streaming from kernel to userspace.
- Stack trace — stack identifiers for profiling.
- Cgroup storage — per-cgroup local data.
- Sockmap, sockhash — socket redirection for accelerated proxying.
- Devmap, cpumap — XDP forwarding tables.
Helpers
A curated set of kernel functions callable from BPF: bpf_map_lookup_elem, bpf_ktime_get_ns, bpf_probe_read_user, bpf_perf_event_output, bpf_get_current_pid_tgid, hundreds of others. Helpers are the kernel’s API to BPF; they’re carefully audited because BPF runs with kernel privileges.
kfuncs
Since ~Linux 5.13, kernel functions can be exposed directly to BPF via the kfunc mechanism — a more flexible alternative to the static helper list. Subsystems own their kfunc sets.
Attach Points
| Attach point | Where it runs | Typical use |
|---|---|---|
| kprobe / kretprobe | Entry/return of any kernel function | Dynamic tracing |
| uprobe / uretprobe | Entry/return of any userspace function | App-level tracing |
| tracepoint | Static kernel tracepoints | Stable tracing (preferred over kprobe) |
| fentry / fexit | Entry/return; uses BPF trampoline | Faster than kprobe; requires BTF |
| perf event | Sampled CPU/cache events | Profiling, flame graphs |
| XDP | NIC driver, before sk_buff allocation | Line-rate packet processing |
| tc (traffic control) | sk_buff ingress/egress | Per-cgroup networking, eBPF service mesh |
| socket filter | Per-socket recv path | Socket-level filtering |
| cgroup | Various per-cgroup hooks (skb, sockops, dev, sysctl) | Per-pod policy |
| LSM-BPF | Linux Security Module hooks | Security policy in BPF |
| sched-ext | Pluggable scheduler hooks | Custom CPU schedulers |
| HID-BPF | Human interface device events | Mouse/keyboard remap |
XDP — eXpress Data Path
XDP (Linux 4.8, August 2016; Jesper Dangaard Brouer at Red Hat plus the broader BPF team) runs BPF programs at the earliest possible point in the network stack — directly inside the NIC driver’s receive path, before any sk_buff allocation. Return codes:
XDP_PASS— let the packet continue up the stack.XDP_DROP— discard with negligible cost.XDP_TX— bounce out the same NIC.XDP_REDIRECT— send to another NIC, AF_XDP socket, or CPU.XDP_ABORTED— error path.
XDP modes:
- Native XDP — driver-level support; fastest. Drivers: i40e, ixgbe, mlx4, mlx5, virtio_net, veth, many more.
- Offloaded XDP — runs on the NIC itself (Netronome Agilio, some Marvell SmartNICs).
- Generic XDP — fallback in
netif_receive_skb; slower, used when driver lacks support.
Performance: native XDP can drop or forward packets at line rate on 100 Gb/s NICs (~14.88 Mpps per core, well past the kernel stack’s ~1-3 Mpps). This is the technology that lets a single Linux box act as a DDoS scrubber for tens of Tbps.
Major Use Cases
Networking and Service Mesh
- Cilium — the dominant cloud-native networking + security + observability project. Started 2015 by Thomas Graf and Daniel Borkmann at Cisco; spun out into Isovalent, which Cisco re-acquired in 2023 for an undisclosed sum widely reported as ~$600M-1B. Cilium replaces iptables and kube-proxy with eBPF-based L2-L7 networking, including identity-aware policy, transparent encryption (WireGuard or IPsec), and L7 protocol parsing (HTTP, gRPC, Kafka, DNS, mTLS). CNCF graduated 2023. Adopted by Google GKE Dataplane V2, AWS EKS (as an option), Azure AKS, Bell Canada, Bloomberg, Datadog, the New York Times, Trip.com, and roughly two-thirds of large Kubernetes deployments per the 2024 CNCF survey.
- Hubble — Cilium’s observability layer; gives flow logs, service maps, network policy verdicts via eBPF without sidecars or agent overhead.
- Calico eBPF dataplane — Calico’s optional eBPF mode (alternative to its default iptables/Linux routing dataplane), shipped 2020.
- Project Antrea — VMware (now Broadcom) CNI with eBPF dataplane option.
Load Balancing
- Facebook/Meta Katran — open-sourced 2018; L4 load balancer using XDP. Replaced their in-kernel IPVS-based system and dramatically reduced per-request CPU. Handles all incoming Facebook traffic.
- Cloudflare — uses XDP for L3/L4 DDoS mitigation. Their public engineering posts describe absorbing 1+ Tbps attacks at line rate; the XDP-based unimog load balancer replaced IPVS.
- Google Maglev — predates XDP but conceptually similar; Google now uses eBPF-heavy dataplanes in GKE.
Observability
- Pixie — auto-instrumentation for Kubernetes; acquired by New Relic in 2020. Uses eBPF (specifically uprobes on TLS libraries, HTTP/gRPC stacks, MySQL/PostgreSQL/Redis protocols) to capture request-level telemetry with zero code changes. Open-sourced 2021.
- Parca — continuous profiler from Polar Signals; collects CPU profiles via perf_events + eBPF stack unwinding, stores as pprof.
- Pyroscope — continuous profiler; acquired by Grafana Labs in 2023 and integrated as Grafana Pyroscope.
- Beyla (Grafana) — zero-code instrumentation for HTTP/gRPC services via eBPF; introduced 2023.
- Coroot — open-source full-stack observability platform built around eBPF.
- Inspektor Gadget (Microsoft / Kinvolk) — Microsoft acquired Kinvolk in 2021; Inspektor Gadget is a collection of eBPF-based debugging tools packaged for Kubernetes.
- bpftrace — high-level DTrace-like one-liner language for ad-hoc kernel tracing; by Alastair Robertson and Brendan Gregg.
- BCC (BPF Compiler Collection) — set of ~150 tools (
tcptop,biolatency,profile,opensnoop,execsnoop,tcpconnect,runqlat, many more) packaged as Python wrappers around BPF programs. The original toolkit by Brenden Blanco at PLUMgrid; Brendan Gregg at Netflix wrote many of the canonical tools. - perf — kernel-bundled profiler; supports BPF programs as event handlers since ~Linux 4.7.
Security
- Falco — CNCF graduated runtime security project; eBPF-based syscall observation with a rule engine for threat detection. Originally from Sysdig. Falco rules detect things like “shell spawned in a container”, “writes to sensitive paths”, “unexpected network connections”.
- Tetragon — runtime security and policy enforcement from Isovalent (now Cisco). Combines observation with kernel-enforced policy via LSM-BPF; can kill processes inline. CNCF sandbox.
- KubeArmor — runtime security from AccuKnox; LSM + eBPF.
- Tracee (Aqua Security) — eBPF-based threat detection with built-in signature library.
- bpfilter — proposed iptables replacement using BPF; effort has been intermittent.
Performance
- Flame graphs — Brendan Gregg popularised the visualisation; eBPF-based stack sampling makes them low-overhead enough for always-on production profiling. CPU flame graphs, off-CPU flame graphs, memory leak flame graphs.
- Off-CPU analysis — finds time spent blocked (lock contention, I/O wait, scheduler) rather than time on CPU. Hard or impossible without kernel-resident instrumentation.
- tcprtt, runqlat, biolatency — BCC tools that produce histograms of network round-trips, scheduler queue latency, block I/O latency.
- node_exporter with eBPF additions for richer per-process and per-cgroup metrics.
Major Tools
| Tool | Purpose | Origin |
|---|---|---|
| Cilium | Cluster networking, security, observability | Isovalent / Cisco |
| Hubble | Network flow observability | Isovalent / Cisco |
| Pixie | Auto-instrumentation for K8s | New Relic (acq. 2020) |
| Parca | Continuous profiling | Polar Signals |
| Pyroscope (Grafana) | Continuous profiling | Grafana Labs (acq. 2023) |
| Falco | Runtime security | Sysdig / CNCF |
| Tetragon | Security policy enforcement | Isovalent / Cisco |
| bpftrace | DTrace-like scripting | OSS (Robertson, Gregg) |
| BCC | ~150 BPF-backed CLI tools | PLUMgrid / IO Visor |
| libbpf | Canonical C loader library | Kernel community |
| libbpf-rs | Rust bindings to libbpf | Meta and OSS |
| Aya | Pure-Rust BPF loader (no libbpf dep) | OSS (Tide.org, Confluent) |
| BumbleBee (Solo.io) | OCI image distribution for BPF programs | Solo.io |
| Inspektor Gadget | K8s eBPF debugging suite | Microsoft (acq. Kinvolk 2021) |
| Coroot | Open-source observability platform | Coroot Labs |
| Spectro Cloud Palette | K8s management with eBPF observability | Spectro Cloud |
Famous Case Studies
- Cloudflare DDoS mitigation — public engineering posts describe absorbing >1 Tbps attacks at the edge via XDP-based packet classifiers; eBPF policies drop attack traffic before any allocation.
- Meta Katran — handles all inbound Facebook/Instagram/WhatsApp traffic via XDP L4LB; open-sourced 2018; published extensive perf data.
- Datadog Agent — uses eBPF for network-performance monitoring, USM (Universal Service Monitoring), CWS (Cloud Workload Security); shipped in their main agent.
- Microsoft Azure — uses eBPF for CXL networking, AKS observability, and ships Inspektor Gadget.
- Netflix — Brendan Gregg’s team has documented extensive use of BCC/bpftrace for production performance analysis.
- OpenAI — public posts mention Pixie + Parca for service observability across their inference fleet.
- Walmart, Capital One, Lyft, the New York Times, Trip.com — all public Cilium reference accounts.
Trade-offs vs Alternatives
vs DPDK / kernel bypass
DPDK and other kernel-bypass dataplanes (VPP, Snabb, AF_XDP itself in zero-copy mode) trade portability for raw speed by handing entire NICs over to userspace pinned cores. eBPF runs in the kernel under the verifier’s safety guarantees and shares NICs with normal traffic. Modern eBPF/XDP closes most of the perf gap (within ~10-20% of DPDK for many workloads) while keeping kernel facilities like routing, netfilter, and easy multi-tenancy. DPDK still wins for absolute peak throughput in single-purpose appliances.
vs SystemTap
SystemTap (early-2000s effort, mainly Red Hat) was the previous “DTrace for Linux”. It compiled scripts to kernel modules — powerful but slow to compile, requires kernel-debuginfo, and crashes the kernel on bugs. eBPF replaced it for dynamic kernel tracing; SystemTap is essentially deprecated for new work.
vs traditional iptables + sysctl
iptables rules are evaluated linearly per packet, so policy with thousands of rules degrades to O(N) per packet. Cilium/eBPF policy is a single bytecode program with verifier-checked complexity bounds, typically O(1) or O(log N). At Kubernetes scale (thousands of services, tens of thousands of pods, millions of policy rules) the difference is dispositive — Cilium with eBPF was driven into mainstream adoption precisely because kube-proxy’s iptables mode couldn’t scale.
vs ptrace / strace
strace uses ptrace, which stops the target process for every syscall — fine for debugging, unusable in production (10-100× slowdown). eBPF tracepoint-based syscall observation has near-zero overhead and never blocks the target.
Future Directions
BPF for Windows
Microsoft announced eBPF-for-Windows in 2021 — an open-source effort to bring the BPF programming model to Windows Server. Uses the kernel-mode parts of uBPF and a verified VM (PREVAIL by VMware Research) to provide source-level compatibility with libbpf programs. Ships in Windows Server 2022+ for networking hooks (WFP integration).
sched-ext
Pluggable BPF schedulers — landed in Linux 6.12 (November 2024). Lets users replace the kernel CFS / EEVDF scheduler with a BPF program. Meta’s scx_layered and scx_lavd are public examples; Meta has spoken about deploying BPF schedulers in production with ~2-7% throughput wins on specific workloads. Other projects: scx_rusty (Rust), scx_simple, scx_central, scx_pair.
LSM-BPF
Linux Security Module hooks reachable from BPF (since 5.7). Enables custom mandatory access control policies written as BPF programs — used by Tetragon, KubeArmor, and increasingly cloud workload protection vendors.
Hardware offload
NIC vendors have moved BPF execution into hardware:
- Netronome Agilio — pioneered BPF offload to the NIC’s flow processors.
- NVIDIA/Mellanox ConnectX-7, BlueField-3 DPU — eBPF programs offloaded to the SmartNIC.
- Intel IPU E2000 — supports BPF programs.
- AMD Pensando DSC — DPU with eBPF programmable pipelines.
HID-BPF
Userspace control over keyboard/mouse/HID device firmware quirks via BPF — landed ~Linux 6.3 (2023). Replaces the historically chaotic mess of per-device kernel modules.
Confidential computing and verification
The verifier itself is being replaced/augmented in research projects: PREVAIL (VMware Research, formally verified BPF verifier), Solana eBPF (used inside the Solana blockchain VM). The trend is toward formally proven verifiers as BPF takes on more security-critical roles.
When to Use eBPF
Good fit:
- Observability with low overhead in production
- Kubernetes networking and security at scale
- DDoS mitigation at line rate
- Custom load balancers and proxies
- Custom schedulers for narrow workloads
- Continuous profiling
Bad fit:
- Heavy CPU computation (verifier instruction limits, no floats)
- Long-running batch work (no unbounded loops)
- Platforms other than recent Linux (Windows port still limited; macOS/BSD have their own paths)