eBPF & Kernel Observability

eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that lets sandboxed programs run in kernel space at well-defined attach points — network ingress, syscalls, function entry/exit, scheduler events, security hooks — without modifying kernel source or loading kernel modules. It has become the substrate for the modern Linux observability, security, and networking stack: Cilium for cluster networking, Pixie and Parca for auto-instrumented observability, Falco and Tetragon for runtime security, Hubble for network visibility, and a flood of cloud-native data-plane tools.

History

Classic BPF (1992)

The original BPF was introduced by Steven McCanne and Van Jacobson at LBL in the 1992 USENIX paper “The BSD Packet Filter: A New Architecture for User-level Packet Capture”. It defined a small RISC-like VM (32-bit, two registers A and X, a small scratch area) used by tcpdump and libpcap to push packet-matching predicates into the kernel so packets could be filtered before the costly copy to userspace. cBPF found its way into Linux for socket filters (SO_ATTACH_FILTER), seccomp-bpf (syscall filtering), and many other niches.

Extended BPF (2014)

Alexei Starovoitov (at PLUMgrid, later Facebook/Meta, now Isovalent/Cisco) led the work to generalise BPF into a 64-bit, 11-register, JIT-compiled VM with a verifier — landing in Linux 3.18 (December 2014). The key additions:

64-bit registers (R0-R10, R10 read-only frame pointer) mapping efficiently to x86_64/arm64 registers.
Maps — kernel data structures shared between BPF programs and userspace.
Helper functions — a curated set of kernel functions BPF can call.
Verifier — proves termination, memory safety, and bounded execution before load.
JIT — bytecode compiled to native machine code on first load.
Attach points beyond sockets — kprobes, tracepoints, uprobes, perf events, XDP, tc, cgroup, LSM, fentry/fexit.

Daniel Borkmann (PLUMgrid → Cisco/Isovalent) co-led the work; the two remain BPF subsystem co-maintainers as of 2025.

Architecture

Toolchain

A BPF program is typically:

Written in restricted C (no unbounded loops, no printk, limited stack, bounded array bounds).
Compiled by Clang/LLVM with -target bpf to produce ELF object files containing BPF bytecode in .text sections plus map definitions in .maps.
Loaded into the kernel via the bpf() syscall, usually through libbpf (the canonical C loader library) or one of its wrappers.
Verified, JITed, and attached to an attach point.

The Verifier

The verifier is BPF’s most distinctive feature. It performs static analysis on the bytecode to prove:

Termination — no unbounded loops; bounded loops are unrolled or use the bpf_loop() helper with an iteration cap.
Memory safety — every pointer access is bounded; verifier tracks register types (PTR_TO_CTX, PTR_TO_MAP_VALUE, PTR_TO_STACK, etc.) and ranges.
No undefined behaviour — uninitialised reads, division by zero, etc.
Bounded instruction count — historically 4096 instructions per program; expanded to ~1M complex insn limit in Linux 5.x.

The verifier reports rich error messages on rejection. A common eBPF developer rite of passage is reading verifier traces.

CO-RE and BTF

CO-RE (Compile Once — Run Everywhere) plus BTF (BPF Type Format) solve the portability problem. Kernels have different struct layouts across versions, so a BPF program compiled against one kernel’s headers historically wouldn’t load on another. CO-RE (introduced ~Linux 5.2, matured 5.4+) gives BPF programs the ability to:

Compile against BTF type information instead of raw struct offsets.
Relocate field accesses at load time based on the running kernel’s BTF.
Ship a single binary that works across kernel versions.

BTF is essentially DWARF debug info, slimmed for BPF use. Modern distributions ship /sys/kernel/btf/vmlinux for the running kernel.

Maps

BPF maps are kernel data structures shared between BPF programs and userspace. Map types:

Hash, LRU hash, percpu hash — key-value lookup.
Array, percpu array — indexed.
Ringbuf, perf event array — bounded streaming from kernel to userspace.
Stack trace — stack identifiers for profiling.
Cgroup storage — per-cgroup local data.
Sockmap, sockhash — socket redirection for accelerated proxying.
Devmap, cpumap — XDP forwarding tables.

Helpers

A curated set of kernel functions callable from BPF: bpf_map_lookup_elem, bpf_ktime_get_ns, bpf_probe_read_user, bpf_perf_event_output, bpf_get_current_pid_tgid, hundreds of others. Helpers are the kernel’s API to BPF; they’re carefully audited because BPF runs with kernel privileges.

kfuncs

Since ~Linux 5.13, kernel functions can be exposed directly to BPF via the kfunc mechanism — a more flexible alternative to the static helper list. Subsystems own their kfunc sets.

Attach Points

Attach point	Where it runs	Typical use
kprobe / kretprobe	Entry/return of any kernel function	Dynamic tracing
uprobe / uretprobe	Entry/return of any userspace function	App-level tracing
tracepoint	Static kernel tracepoints	Stable tracing (preferred over kprobe)
fentry / fexit	Entry/return; uses BPF trampoline	Faster than kprobe; requires BTF
perf event	Sampled CPU/cache events	Profiling, flame graphs
XDP	NIC driver, before sk_buff allocation	Line-rate packet processing
tc (traffic control)	sk_buff ingress/egress	Per-cgroup networking, eBPF service mesh
socket filter	Per-socket recv path	Socket-level filtering
cgroup	Various per-cgroup hooks (skb, sockops, dev, sysctl)	Per-pod policy
LSM-BPF	Linux Security Module hooks	Security policy in BPF
sched-ext	Pluggable scheduler hooks	Custom CPU schedulers
HID-BPF	Human interface device events	Mouse/keyboard remap

XDP — eXpress Data Path

XDP (Linux 4.8, August 2016; Jesper Dangaard Brouer at Red Hat plus the broader BPF team) runs BPF programs at the earliest possible point in the network stack — directly inside the NIC driver’s receive path, before any sk_buff allocation. Return codes:

XDP_PASS — let the packet continue up the stack.
XDP_DROP — discard with negligible cost.
XDP_TX — bounce out the same NIC.
XDP_REDIRECT — send to another NIC, AF_XDP socket, or CPU.
XDP_ABORTED — error path.

XDP modes:

Native XDP — driver-level support; fastest. Drivers: i40e, ixgbe, mlx4, mlx5, virtio_net, veth, many more.
Offloaded XDP — runs on the NIC itself (Netronome Agilio, some Marvell SmartNICs).
Generic XDP — fallback in netif_receive_skb; slower, used when driver lacks support.

Performance: native XDP can drop or forward packets at line rate on 100 Gb/s NICs (~14.88 Mpps per core, well past the kernel stack’s ~1-3 Mpps). This is the technology that lets a single Linux box act as a DDoS scrubber for tens of Tbps.

Major Use Cases

Networking and Service Mesh

Cilium — the dominant cloud-native networking + security + observability project. Started 2015 by Thomas Graf and Daniel Borkmann at Cisco; spun out into Isovalent, which Cisco re-acquired in 2023 for an undisclosed sum widely reported as ~$600M-1B. Cilium replaces iptables and kube-proxy with eBPF-based L2-L7 networking, including identity-aware policy, transparent encryption (WireGuard or IPsec), and L7 protocol parsing (HTTP, gRPC, Kafka, DNS, mTLS). CNCF graduated 2023. Adopted by Google GKE Dataplane V2, AWS EKS (as an option), Azure AKS, Bell Canada, Bloomberg, Datadog, the New York Times, Trip.com, and roughly two-thirds of large Kubernetes deployments per the 2024 CNCF survey.
Hubble — Cilium’s observability layer; gives flow logs, service maps, network policy verdicts via eBPF without sidecars or agent overhead.
Calico eBPF dataplane — Calico’s optional eBPF mode (alternative to its default iptables/Linux routing dataplane), shipped 2020.
Project Antrea — VMware (now Broadcom) CNI with eBPF dataplane option.

Load Balancing

Facebook/Meta Katran — open-sourced 2018; L4 load balancer using XDP. Replaced their in-kernel IPVS-based system and dramatically reduced per-request CPU. Handles all incoming Facebook traffic.
Cloudflare — uses XDP for L3/L4 DDoS mitigation. Their public engineering posts describe absorbing 1+ Tbps attacks at line rate; the XDP-based unimog load balancer replaced IPVS.
Google Maglev — predates XDP but conceptually similar; Google now uses eBPF-heavy dataplanes in GKE.

Observability

Pixie — auto-instrumentation for Kubernetes; acquired by New Relic in 2020. Uses eBPF (specifically uprobes on TLS libraries, HTTP/gRPC stacks, MySQL/PostgreSQL/Redis protocols) to capture request-level telemetry with zero code changes. Open-sourced 2021.
Parca — continuous profiler from Polar Signals; collects CPU profiles via perf_events + eBPF stack unwinding, stores as pprof.
Pyroscope — continuous profiler; acquired by Grafana Labs in 2023 and integrated as Grafana Pyroscope.
Beyla (Grafana) — zero-code instrumentation for HTTP/gRPC services via eBPF; introduced 2023.
Coroot — open-source full-stack observability platform built around eBPF.
Inspektor Gadget (Microsoft / Kinvolk) — Microsoft acquired Kinvolk in 2021; Inspektor Gadget is a collection of eBPF-based debugging tools packaged for Kubernetes.
bpftrace — high-level DTrace-like one-liner language for ad-hoc kernel tracing; by Alastair Robertson and Brendan Gregg.
BCC (BPF Compiler Collection) — set of ~150 tools (tcptop, biolatency, profile, opensnoop, execsnoop, tcpconnect, runqlat, many more) packaged as Python wrappers around BPF programs. The original toolkit by Brenden Blanco at PLUMgrid; Brendan Gregg at Netflix wrote many of the canonical tools.
perf — kernel-bundled profiler; supports BPF programs as event handlers since ~Linux 4.7.

Security

Falco — CNCF graduated runtime security project; eBPF-based syscall observation with a rule engine for threat detection. Originally from Sysdig. Falco rules detect things like “shell spawned in a container”, “writes to sensitive paths”, “unexpected network connections”.
Tetragon — runtime security and policy enforcement from Isovalent (now Cisco). Combines observation with kernel-enforced policy via LSM-BPF; can kill processes inline. CNCF sandbox.
KubeArmor — runtime security from AccuKnox; LSM + eBPF.
Tracee (Aqua Security) — eBPF-based threat detection with built-in signature library.
bpfilter — proposed iptables replacement using BPF; effort has been intermittent.

Performance

Flame graphs — Brendan Gregg popularised the visualisation; eBPF-based stack sampling makes them low-overhead enough for always-on production profiling. CPU flame graphs, off-CPU flame graphs, memory leak flame graphs.
Off-CPU analysis — finds time spent blocked (lock contention, I/O wait, scheduler) rather than time on CPU. Hard or impossible without kernel-resident instrumentation.
tcprtt, runqlat, biolatency — BCC tools that produce histograms of network round-trips, scheduler queue latency, block I/O latency.
node_exporter with eBPF additions for richer per-process and per-cgroup metrics.

Major Tools

Tool	Purpose	Origin
Cilium	Cluster networking, security, observability	Isovalent / Cisco
Hubble	Network flow observability	Isovalent / Cisco
Pixie	Auto-instrumentation for K8s	New Relic (acq. 2020)
Parca	Continuous profiling	Polar Signals
Pyroscope (Grafana)	Continuous profiling	Grafana Labs (acq. 2023)
Falco	Runtime security	Sysdig / CNCF
Tetragon	Security policy enforcement	Isovalent / Cisco
bpftrace	DTrace-like scripting	OSS (Robertson, Gregg)
BCC	~150 BPF-backed CLI tools	PLUMgrid / IO Visor
libbpf	Canonical C loader library	Kernel community
libbpf-rs	Rust bindings to libbpf	Meta and OSS
Aya	Pure-Rust BPF loader (no libbpf dep)	OSS (Tide.org, Confluent)
BumbleBee (Solo.io)	OCI image distribution for BPF programs	Solo.io
Inspektor Gadget	K8s eBPF debugging suite	Microsoft (acq. Kinvolk 2021)
Coroot	Open-source observability platform	Coroot Labs
Spectro Cloud Palette	K8s management with eBPF observability	Spectro Cloud

Famous Case Studies

Cloudflare DDoS mitigation — public engineering posts describe absorbing >1 Tbps attacks at the edge via XDP-based packet classifiers; eBPF policies drop attack traffic before any allocation.
Meta Katran — handles all inbound Facebook/Instagram/WhatsApp traffic via XDP L4LB; open-sourced 2018; published extensive perf data.
Datadog Agent — uses eBPF for network-performance monitoring, USM (Universal Service Monitoring), CWS (Cloud Workload Security); shipped in their main agent.
Microsoft Azure — uses eBPF for CXL networking, AKS observability, and ships Inspektor Gadget.
Netflix — Brendan Gregg’s team has documented extensive use of BCC/bpftrace for production performance analysis.
OpenAI — public posts mention Pixie + Parca for service observability across their inference fleet.
Walmart, Capital One, Lyft, the New York Times, Trip.com — all public Cilium reference accounts.

Trade-offs vs Alternatives

vs DPDK / kernel bypass

DPDK and other kernel-bypass dataplanes (VPP, Snabb, AF_XDP itself in zero-copy mode) trade portability for raw speed by handing entire NICs over to userspace pinned cores. eBPF runs in the kernel under the verifier’s safety guarantees and shares NICs with normal traffic. Modern eBPF/XDP closes most of the perf gap (within ~10-20% of DPDK for many workloads) while keeping kernel facilities like routing, netfilter, and easy multi-tenancy. DPDK still wins for absolute peak throughput in single-purpose appliances.

vs SystemTap

SystemTap (early-2000s effort, mainly Red Hat) was the previous “DTrace for Linux”. It compiled scripts to kernel modules — powerful but slow to compile, requires kernel-debuginfo, and crashes the kernel on bugs. eBPF replaced it for dynamic kernel tracing; SystemTap is essentially deprecated for new work.

vs traditional iptables + sysctl

iptables rules are evaluated linearly per packet, so policy with thousands of rules degrades to O(N) per packet. Cilium/eBPF policy is a single bytecode program with verifier-checked complexity bounds, typically O(1) or O(log N). At Kubernetes scale (thousands of services, tens of thousands of pods, millions of policy rules) the difference is dispositive — Cilium with eBPF was driven into mainstream adoption precisely because kube-proxy’s iptables mode couldn’t scale.

vs ptrace / strace

strace uses ptrace, which stops the target process for every syscall — fine for debugging, unusable in production (10-100× slowdown). eBPF tracepoint-based syscall observation has near-zero overhead and never blocks the target.

Future Directions

BPF for Windows

Microsoft announced eBPF-for-Windows in 2021 — an open-source effort to bring the BPF programming model to Windows Server. Uses the kernel-mode parts of uBPF and a verified VM (PREVAIL by VMware Research) to provide source-level compatibility with libbpf programs. Ships in Windows Server 2022+ for networking hooks (WFP integration).

sched-ext

Pluggable BPF schedulers — landed in Linux 6.12 (November 2024). Lets users replace the kernel CFS / EEVDF scheduler with a BPF program. Meta’s scx_layered and scx_lavd are public examples; Meta has spoken about deploying BPF schedulers in production with ~2-7% throughput wins on specific workloads. Other projects: scx_rusty (Rust), scx_simple, scx_central, scx_pair.

LSM-BPF

Linux Security Module hooks reachable from BPF (since 5.7). Enables custom mandatory access control policies written as BPF programs — used by Tetragon, KubeArmor, and increasingly cloud workload protection vendors.

Hardware offload

NIC vendors have moved BPF execution into hardware:

Netronome Agilio — pioneered BPF offload to the NIC’s flow processors.
NVIDIA/Mellanox ConnectX-7, BlueField-3 DPU — eBPF programs offloaded to the SmartNIC.
Intel IPU E2000 — supports BPF programs.
AMD Pensando DSC — DPU with eBPF programmable pipelines.

HID-BPF

Userspace control over keyboard/mouse/HID device firmware quirks via BPF — landed ~Linux 6.3 (2023). Replaces the historically chaotic mess of per-device kernel modules.

Confidential computing and verification

The verifier itself is being replaced/augmented in research projects: PREVAIL (VMware Research, formally verified BPF verifier), Solana eBPF (used inside the Solana blockchain VM). The trend is toward formally proven verifiers as BPF takes on more security-critical roles.

When to Use eBPF

Good fit:

Observability with low overhead in production
Kubernetes networking and security at scale
DDoS mitigation at line rate
Custom load balancers and proxies
Custom schedulers for narrow workloads
Continuous profiling

Bad fit:

Heavy CPU computation (verifier instruction limits, no floats)
Long-running batch work (no unbounded loops)
Platforms other than recent Linux (Windows port still limited; macOS/BSD have their own paths)

Adjacent

linux kernel architecture
networking stack tcp quic
container runtimes kubernetes
observability and tracing
sre and production engineering
smartnics and dpus

Compendium

Explorer

eBPF & Kernel Observability — XDP, kprobes, perf, BCC, bpftrace, Cilium