CPU Cache & Performance — Compute Reference

1. At a glance

Modern CPU performance is dominated by three forces, not by raw clock rate: the memory hierarchy (most stalls are cache and TLB misses, not arithmetic), branch prediction (a single mispredict costs more than dozens of correctly predicted branches), and parallelism at three levels — instruction-level parallelism (ILP) inside one core, data-level parallelism via SIMD, and thread-level parallelism across cores and sockets. A modern Intel Sapphire Rapids, AMD Zen 4, Apple M3, or ARM Neoverse V2 core can retire 6–8 µops per cycle, fuse loads with arithmetic, predict 99%+ of branches, and prefetch streams it has never seen — but only if the code lets it.

The canonical latency ladder every performance engineer keeps in their head (rough order-of-magnitude on a 3–4 GHz core):

LevelLatencyCapacity
Register~0.3 ns (~1 cycle)~16-32 GPRs + ~32 vector regs
L1 cache (hit)~1-2 ns (~4-5 cycles)32-64 KB per core
L2 cache (hit)~3-5 ns (~12-15 cycles)256 KB-1.25 MB per core
L3 / LLC (hit)~10-20 ns (~40-80 cycles)1-2 MB per core, shared
DRAM (LLC miss)~70-100 ns (~250-400 cycles)GBs
Local NVMe SSD~10-100 µsTBs
Network round-trip (datacenter)~50-500 µsunbounded
HDD seek~5-10 msTBs
Cross-region cloud~50-150 msPBs

LLC miss costs roughly 100× an L1 hit. This is the single most important number in performance engineering — every cache-aware design exists to keep working sets inside L1 or L2, and every cache-oblivious algorithm exists to make the asymptotic miss count optimal at every level simultaneously (Frigo et al. 1999).

Throughput is a separate story from latency: a modern core can issue ~4 loads/cycle to L1, and DRAM bandwidth on a server scales into the hundreds of GB/s, but bandwidth only matters if the access pattern lets the prefetcher keep ahead of demand. The two failure modes — latency-bound (pointer chasing) and bandwidth-bound (streaming) — require different fixes.

A useful mental model: a 3 GHz core can perform ~9 billion fp64 FLOPs/sec with AVX-512 FMA, but at 100 GB/s DRAM bandwidth it can only stream ~12.5 GB of fp64 data per second — a 720× gap. Anything bottlenecked on memory bandwidth is bandwidth-bound and FLOPS are irrelevant; anything bottlenecked on individual access latency (single-threaded pointer chase) sees neither FLOPS nor bandwidth, only round-trip time. Tools like the Roofline model (Williams, Waterman & Patterson 2009) plot operational intensity (FLOP/byte) against peak FLOPS to show which regime a kernel lives in and where the ceiling is.

Performance work proceeds in three phases that should never be skipped or reordered: measure (profile production-shaped workloads), understand (find the dominant stall — memory, branch, IO, lock), fix (apply the smallest change that moves the bottleneck). Optimizing without measurement is folklore engineering; the same kernel that benefits from -O3 -march=native on Sapphire Rapids may regress on Graviton4 if it hardcodes assumptions about AVX-512 mask registers.

2. Memory hierarchy

The hierarchy is engineered so each level is roughly 10× larger and 4–10× slower than the one above it. A modern x86 server core typically has:

  • L1 instruction cache (L1i) — 32 KB, 8-way set associative, ~4-cycle access. Holds decoded or pre-decoded instructions. Front-end stalls (L1i miss, decoder uop-cache miss) are easy to miss in profiles.
  • L1 data cache (L1d) — 32-48 KB, 8-12 way, 4-5 cycle. Split from L1i (Harvard architecture inside the core, even on von-Neumann ISAs).
  • L2 cache — 512 KB to 1.25 MB per core (Intel Sapphire Rapids 2 MB, AMD Zen 4 1 MB, Apple M3 Performance core 16 MB shared between two cores). Unified (instructions + data), ~12-cycle access, mostly inclusive of L1 on Intel, victim-cache-style on AMD.
  • L3 / LLC — 1.5-3 MB per core, shared across a cluster (Intel “non-inclusive” since Skylake-X; AMD CCX-shared on Zen). Sapphire Rapids: 1.875 MB/core × N cores. AMD EPYC Genoa: 32 MB per CCD shared by 8 Zen 4 cores.
  • DRAM — DDR4 / DDR5 main memory. Latency 70-100 ns local, bandwidth 50 GB/s per DDR5-6400 channel.
  • Persistent memory / CXL — Optane DCPMM (discontinued 2022), CXL.mem-attached DDR5 — sub-µs latency, byte-addressable.
  • NVMe SSD — 10-100 µs latency, ~7-14 GB/s per PCIe 5.0 ×4 drive.
  • HDD / object storage — milliseconds.

Cache line size = 64 bytes on virtually all x86 (since Pentium 4), all ARM Cortex-A and Neoverse, all RISC-V mainstream cores. (Some Apple M-series performance cores use 128-byte lines, which subtly changes false-sharing padding.) Every load brings 64 B; every store eventually writes back 64 B; alignment to 64 B unlocks aligned vector loads and avoids straddling two lines.

Inclusivity matters for invalidation: an inclusive L3 (older Intel) tracks every cached line in lower levels so cross-core invalidations can be filtered at L3. A non-inclusive L3 (Skylake-X+, Zen) uses a separate snoop filter directory; this is invisible to software except in profilers (LLC miss may still hit a neighbor’s L2 — Intel’s “LLC-load-miss” event undercounts the true miss-to-DRAM rate).

3. Cache organization

Caches are organized as N-way set-associative arrays: an address’s middle bits select a set, and any of the N “ways” in that set can hold the line. Direct-mapped (1-way) is fast and simple but suffers conflict misses; fully-associative is flexible but expensive. Real caches sit in between: L1 is typically 8-way, L2 is 8-16 way, L3 is 16-32 way.

The address is split into:

[ tag | set index | block offset ]

For a 32 KB, 8-way, 64 B-line L1: 64 sets × 8 ways × 64 B = 32 KB; the bottom 6 bits are offset, the next 6 bits are set index, the rest is tag.

Replacement policies:

  • LRU (Least Recently Used) — optimal for many workloads but expensive to track exactly beyond 4 ways. Caches use pseudo-LRU (tree-PLRU or bit-PLRU) approximations.
  • Random — surprisingly competitive; used in some ARM cores.
  • RRIP (Re-Reference Interval Prediction, Jaleel et al. 2010) — predicts which lines are likely scan-once vs reused; standard on modern Intel L3.
  • DRRIP / SHiP — adaptive variants used in production.

Write policies:

  • Write-back (universal in modern L1/L2/L3) — store updates the line in cache; written to memory only on eviction. Reduces memory bandwidth ~10× for typical write workloads.
  • Write-through — every store goes to next level. Simpler coherence, rarely used past L1 on embedded cores.
  • Write-allocate — on a write miss, fetch the line, then write. Standard for write-back caches; combined with store buffer + non-temporal stores (MOVNTPS, MOVNTI) for streaming writes that bypass the cache.
  • No-write-allocate — on miss, write directly to next level without allocating. Used for streaming workloads to avoid pollution.

Non-temporal stores (NT stores) bypass the cache for one-pass writes (e.g. memset of a buffer larger than L3). Pair with _mm_sfence() to order them against subsequent loads.

4. Coherence protocols

Multi-core caches must look like a single coherent memory to software. The dominant protocol family is MESI and its extensions:

  • MESI (Papamarcos & Patel 1984) — every line is in one of four states per core: Modified (dirty, exclusive owner), Exclusive (clean, exclusive owner — no copy elsewhere), Shared (clean, possibly other copies), Invalid (not present / stale).
  • MOESI (AMD; also used in some IBM POWER) — adds Owned state: dirty but shareable with other caches. Avoids a write-back when one core has the dirty copy and another reads it — the dirty cache forwards the line directly.
  • MESIF (Intel since Nehalem) — adds Forward state: at most one Shared copy is designated F and responds to read requests. Reduces broadcast traffic.

Coherence transactions are physically realized by either:

  • Snooping — every cache observes every memory transaction on a shared bus; scales poorly past ~8 cores.
  • Directory-based — a directory (typically co-located with L3 slice) tracks which cores hold each line; messages are point-to-point. All large multi-socket systems use directories.

Coherence is the dominant scaling bottleneck on multi-core: a cache line that bounces between cores (“ping-pong”) incurs hundreds of cycles per transition. Cross-socket coherence (UPI on Intel, Infinity Fabric on AMD, CCIX/CXL.cache cross-vendor) adds another order of magnitude.

5. False sharing

The pathological case: two threads write to different variables that happen to live in the same cache line. Each write invalidates the other core’s copy; the line ping-pongs between L1s at coherence speeds. Throughput collapses by 10-100×.

Classic example — per-thread counters in an array:

struct Counters {
    uint64_t per_thread[16];  // 8 B each, all share two cache lines
};

Fix by padding to a full cache line:

struct alignas(64) PaddedCounter {
    uint64_t value;
    char pad[56];
};
PaddedCounter counters[16];

In C++17+: alignas(std::hardware_destructive_interference_size). Linux kernel: ____cacheline_aligned. Java: @Contended (JEP 142, JDK 8+). Rust: #[repr(align(64))] or crossbeam::utils::CachePadded.

False sharing is invisible in source code — detection requires either a profiler (perf c2c on Linux, VTune’s Memory Access analysis) or careful layout review. perf c2c record followed by perf c2c report shows HITM (hit-modified) events on contended lines.

6. TLB (Translation Lookaside Buffer)

Every load/store goes through virtual→physical address translation. The TLB caches recent translations:

  • L1 dTLB — 64-128 entries on modern x86 (Sapphire Rapids dTLB: 96 entries 4 KB + 32 entries 2 MB). At 4 KB pages, 96 entries covers 384 KB — smaller than L2. The TLB, not the cache, often dictates the maximum working-set size you can scan without stalls.
  • L1 iTLB — separate for instructions, ~128 entries.
  • L2 TLB (STLB) — unified, 1.5-2K entries, handles both. Coverage at 4 KB: ~6-8 MB.

On a TLB miss, the page-walk unit (PMH) traverses the page tables — on x86-64 4-level paging, that’s 4 memory accesses (PML4 → PDPT → PD → PT); 5 levels on Ice Lake+ for >48-bit virtual addresses. Each access can itself miss in cache.

Huge pages are the primary mitigation:

  • 2 MB pages (x86 PSE, ARM 2 MB blocks): one TLB entry covers 512× more memory. STLB at 32 entries 2 MB = 64 MB coverage.
  • 1 GB pages (x86 PSE-1G, ARM): even larger; useful for DBMS buffer pools and JVM heaps.
  • Linux: Transparent Huge Pages (THP) auto-promotes; explicit hugetlbfs for guaranteed allocation. Tune via /sys/kernel/mm/transparent_hugepage/.
  • Windows: LargePage allocation with SeLockMemoryPrivilege.

TLB misses are visible as dTLB-load-misses and iTLB-load-misses in perf stat. Workloads that touch many random pages (hash tables, large heaps) often see TLB miss rates 10× the L1 cache miss rate.

7. NUMA (Non-Uniform Memory Access)

On multi-socket systems and modern monolithic chiplet designs (AMD EPYC, Intel Sapphire Rapids in SNC mode), each socket / die has its own DRAM controllers; cross-socket accesses traverse the interconnect (UPI / Infinity Fabric / CXL). Remote DRAM is 1.5-2× the latency of local DRAM and bandwidth-limited by the interconnect.

Allocation policies (Linux, via numactl or mbind):

  • --localalloc (default with --cpunodebind) — pages allocated on the CPU’s node. Best for thread-local data.
  • --interleave=all — round-robin across nodes. Best for shared read-mostly data; loses locality but evens bandwidth.
  • --membind=<node> — pin allocation to a specific node.
  • --preferred=<node> — try the preferred node, fall back if full.

Programmatic control: libnuma (numa_alloc_onnode, numa_set_membind); mbind(2) system call; set_mempolicy(2) for thread defaults.

NUMA-aware allocators (jemalloc, mimalloc, tcmalloc) maintain per-CPU arenas to keep allocations node-local. JVM: -XX:+UseNUMA for ParallelGC and G1.

Diagnostic: numastat -p <pid> shows per-node memory; perf stat -e node-loads,node-load-misses,node-stores,node-store-misses shows cross-node access counts.

8. Memory prefetching

Modern CPUs are aggressive prefetchers — most well-behaved sequential code never stalls on memory because hardware predicts the next line.

Hardware prefetchers (multiple per core):

  • L1 stream prefetcher — detects monotonic strides (forward or reverse), fetches ahead 1-2 lines.
  • L1 IP-based / instruction prefetcher — tracks per-instruction strides; handles structured but non-unit-stride loops.
  • L2 stream / spatial prefetcher — fetches adjacent cache line on access (“companion line”), and detects longer-range strides.
  • Next-page prefetcher — pre-walks the TLB for the next 4 KB page.

Prefetchers fail on: random access (hash tables, graph traversal), pointer chasing (linked structures), short loops (no time to detect a stride), and stride distances larger than ~512 B.

Software prefetch for cases hardware can’t predict:

for (int i = 0; i < n; ++i) {
    __builtin_prefetch(&data[indices[i + 16]]);  // GCC/Clang
    process(data[indices[i]]);
}

_mm_prefetch with hints _MM_HINT_T0 (all levels), _T1 (L2+), _T2 (L3+), _NTA (non-temporal — bypass cache hierarchy).

Optimal prefetch distance ≈ memory latency × instruction throughput per iteration. Too close → arrives after needed; too far → pollutes cache and gets evicted. Tune empirically; 8-32 iterations ahead is a common starting point.

9. Branch prediction

Modern pipelines are 14-20 stages deep; on a mispredicted branch the entire in-flight pipeline must be flushed, costing 15-25 cycles (≈ 5-8 ns at 3 GHz). At 99% accuracy and one branch every 5 instructions, mispredicts still cost ~5% of cycles; at 95% accuracy, ~25%.

Branch types and predictors:

  • Direct conditional branches (je, jne) — target known at decode; only direction predicted. Two-level adaptive predictor (Yeh & Patt 1991): per-branch history table + global pattern history. Modern variants use TAGE (Seznec 2006) with multiple geometric-history tables — published academic accuracy >99% on SPEC.
  • Indirect branches (call *rax, vtable dispatch, switch jump tables, computed gotos) — target itself must be predicted via the Branch Target Buffer (BTB). Megamorphic call sites (>4 unique targets) saturate the BTB and mispredict.
  • Return instructions — the Return Stack Buffer (RSB) is a small hardware stack (16-32 entries) tracking call returns. Deep recursion or setjmp can corrupt it.
  • Loop predictors — recognize repeating loop counts.

Modern Intel and AMD ship multiple parallel predictors (TAGE-L + perceptron + loop) and choose the most confident; Apple M-series is similar but undocumented.

Diagnostic: perf stat -e branches,branch-misses. Mispredict rate >2% is worth investigating; >5% is a hot bug.

10. Mitigations for unpredictable branches

When data is genuinely random (e.g. partitioning around a random pivot), prediction fails and the only fix is to eliminate the branch.

Conditional moves — compilers emit cmov for x = cond ? a : b patterns when the cost model favors it; both sides execute unconditionally and the right one is selected. No mispredict possible, but both expressions must be cheap and side-effect-free.

int max_branchless(int a, int b) {
    return a > b ? a : b;  // compiled to cmov
}

Bitwise tricks — sign-extension + mask:

int abs_branchless(int x) {
    int mask = x >> 31;
    return (x + mask) ^ mask;
}

Sort + batch — group similar branches together so the predictor sees a stable pattern. Classic example: sorted array iteration is 5-10× faster than unsorted due to branch predictability (the famous Stack Overflow question, 2012).

Lookup tables replace branch trees with array indexing — only helps when the table fits in L1.

Branchless partitioning — Lomuto / Hoare partitions can be rewritten branch-free for sort kernels (used in pdqsort and ips4o).

11. Speculative-execution attacks

Speculative execution — running ahead before knowing if a branch was correct — is essential for performance but in 2018 was shown to leak data through cache side-channels.

  • Spectre v1 (Bounds Check Bypass), Spectre v2 (Branch Target Injection) — Kocher et al. & Horn (Google Project Zero), 2018. Train the branch predictor to mis-speculate into a gadget that loads attacker-chosen data; the load’s cache footprint leaks via Flush+Reload.
  • Meltdown (CVE-2017-5754) — Intel-specific; speculative loads bypass user/kernel permission checks before the fault retires.
  • MDS family — Microarchitectural Data Sampling: RIDL, Fallout, ZombieLoad (2019) — leak data from line-fill buffers, store buffers, load ports.
  • L1TF / Foreshadow (2018) — leak from L1 across SMT siblings and VMs.
  • TAA — TSX Asynchronous Abort (2019); related to MDS.
  • CrossTalk (2020) — across cores via staging buffers.
  • Retbleed (2022) — return-instruction speculation on pre-Zen3 AMD and pre-Skylake Intel.
  • Downfall (Intel, 2023) — gather-instruction leak.
  • Inception / Phantom (AMD Zen 1-4, 2023).

Mitigations are a layered mess:

  • KPTI / KAISER — unmap most of kernel from user page tables; ~5-30% syscall slowdown.
  • IBRS / IBPB / STIBP — Indirect Branch Restricted Speculation; microcode controls.
  • Retpoline — replace indirect branches with a return-trampoline that resists BTI; significant cost on virtualization-heavy workloads.
  • mds_clear / VERW — flush microarchitectural buffers on context switch.
  • PCID / ASID — avoid TLB flush on KPTI switch.
  • SMT disable — for multi-tenant workloads where cross-sibling leakage is unacceptable (cloud providers; some HPC sites).

Cumulative overhead on syscall-heavy workloads (databases, network proxies): 5-30%. Compute-bound workloads see <1%.

12. SIMD / vectorization

A single instruction operates on a register containing multiple values. Modern ISAs:

  • x86 SSE (1999, 128-bit, 4× fp32) → SSE2/3/4AVX (2011, 256-bit, 8× fp32) → AVX2 (2013, 256-bit integer) → AVX-512 (Knights Landing 2016, Skylake-X 2017, Sapphire Rapids 2023; 512-bit, 16× fp32, mask registers, scatter/gather). AMD added AVX-512 in Zen 4 (2022) via double-pumped 256-bit datapaths. AVX10 (announced Intel 2023) unifies the AVX-512 instruction set onto 256-bit hardware for E-cores.
  • ARM NEON (Advanced SIMD) — 128-bit fixed-width, mandatory in ARMv8-A. SVE (Scalable Vector Extension) — variable 128-2048 bit (Fujitsu A64FX 512-bit, AWS Graviton3 256-bit). SVE2 extends SVE to general-purpose data types (ARMv9-A).
  • RISC-V V extension (ratified 2021) — variable-length, similar in spirit to SVE; implementations from SiFive, Andes, Tenstorrent.
  • IBM POWER VSX, z/Architecture vector facility — analogous.

Compiler auto-vectorization — GCC, Clang/LLVM, MSVC, Intel ICC/ICX all auto-vectorize “well-behaved” inner loops. Blockers:

  • Pointer aliasingrestrict keyword (__restrict__ in C++) tells the compiler pointers don’t overlap.
  • Reductions — fp reductions are not associative; enable with -ffast-math or #pragma omp simd reduction(+:sum).
  • Control flow — predicated execution via mask registers (AVX-512, SVE, NEON via select) handles if-inside-loop; pre-AVX-512 needs blend instructions.
  • Loop-carried dependencies — break with reduction trees or rewrites.
  • Non-unit strides / gathers — supported on AVX2 gather, AVX-512 gather/scatter, SVE gather; slower than contiguous loads.

Intrinsics for explicit vector code:

__m256 a = _mm256_load_ps(arr_a);
__m256 b = _mm256_load_ps(arr_b);
__m256 c = _mm256_fmadd_ps(a, b, c);  // c += a*b, fused
_mm256_store_ps(out, c);

Portable SIMD libraries:

  • std::experimental::simd (C++ TS, eventual C++26).
  • xsimd — header-only, used by xtensor + Apache Arrow.
  • Google Highway — portable to x86/ARM/RISC-V/POWER/WASM; used in libjxl, JPEG-XL.
  • Eigen, Blaze, VOLK (signal processing).
  • Rust: std::simd (portable), wide crate, packed_simd2.
  • Zig: built-in @Vector(N, T).

Width matters: AVX-512 doubles throughput vs AVX2 if the workload is compute-bound and the chip doesn’t downclock (Intel pre-Ice Lake aggressively downclocked on heavy AVX-512; mostly fixed on Sapphire Rapids+).

13. Instruction-level parallelism (ILP)

A single core can execute multiple instructions per cycle out of order:

  • Out-of-order execution (Tomasulo 1967; commercial since IBM POWER1 1990 / DEC Alpha 21264 / Intel Pentium Pro 1995). Instructions enter a reorder buffer (ROB) and reservation stations, execute on whichever functional unit is free, retire in program order.
  • ROB size — 224 entries on Intel Sapphire Rapids, 320 on Apple M2 Performance core (largest in industry as of 2024), ~200 on AMD Zen 4.
  • Issue width — 4-wide on most x86 (Ice Lake 5-wide, Sapphire Rapids 6-wide); AMD Zen 4 6-wide; Apple M-series 8-wide decode, ~10-wide execute.
  • Functional units — modern cores have 4+ integer ALUs, 2-4 load AGUs, 2 store AGUs, 2-4 FP/SIMD pipes, 1-2 branch units.
  • Register renaming — physical register files (PRF) much larger than architectural (Sapphire Rapids 280 int PRF, 332 vec PRF). Hides false dependencies (WAR, WAW).

Dependency chains limit ILP: a chain like x = f(g(h(y))) serializes through h→g→f and has parallelism 1, regardless of how wide the issue. Break chains by:

  • Loop unrolling with independent accumulators (4-8 partial sums in parallel, summed at the end).
  • Reduction trees instead of linear reductions.
  • Strength reduction — replace serial multiplies with adds where possible.

Diagnostic: IPC (instructions per cycle) via perf stat. Compute-bound kernels should hit 3-5+ IPC; <1.5 IPC usually means memory or branch-mispredict bound.

Memory-level parallelism (MLP) is the load-side analog of ILP: a single core can have ~10-12 outstanding cache misses in flight via the Line Fill Buffers (LFBs, 10 on Skylake, 12 on Sapphire Rapids) and ~30-50 outstanding loads in the Load Buffer. A latency-bound code with 100 ns DRAM access can still achieve 30 GB/s per core if 30 loads are in flight simultaneously. Pointer-chasing kernels achieve MLP=1 and saturate at ~640 MB/s per core. The Little’s-law identity holds: throughput = MLP / latency. Cache prefetchers and software prefetch exist specifically to raise MLP.

Macro-op fusion and micro-op cache add another dimension on x86: cmp + jcc fuse into one µop; mov + arithmetic may fuse; small loops are served from a 1.5-4 KB µop cache (Decoded Stream Buffer / DSB on Intel, Op Cache on AMD) bypassing the decoder entirely. The legacy decoder front-end is often the secret bottleneck on dense AVX-512 code where each instruction takes 8-12 bytes.

14. Memory ordering & atomics

Different ISAs offer different visibility guarantees between cores:

  • x86 TSO (Total Store Order) — strong model. Loads can reorder before earlier stores (store buffer forwarding), but stores stay in program order; loads stay in program order with respect to other loads; transitivity holds. The only fence most x86 code needs is MFENCE for store-load ordering (or LOCK prefix as a side-effect of atomic RMW).
  • ARM / Power weak memory model — almost any reordering allowed unless an explicit barrier (DMB / DSB / ISB on ARM; lwsync / sync on POWER) or an acquire/release pair is used.
  • RISC-V — weak by default (RVWMO model); optional Ztso extension for x86-like behavior.

Implications: code that “works” on x86 may have data races on ARM. Always use std::atomic with explicit memory orders (memory_order_acquire, release, acq_rel, seq_cst, relaxed) — see [[Compute/concurrency-primitives]] for the C++/Rust/Java/Go memory model details.

Barrier instructions:

  • x86: LFENCE (load fence), SFENCE (store fence), MFENCE (full); LOCK-prefixed RMW also acts as a full fence.
  • ARM: DMB ISH (data memory barrier, inner shareable), DSB, ISB.
  • POWER: lwsync (lightweight), sync, isync.

Compiler fences (std::atomic_signal_fence, asm volatile("" ::: "memory")) prevent reordering by the compiler but emit no machine instruction; useful between an atomic-relaxed load and a non-atomic computation that must observe it. volatile in C/C++ is a compiler-visibility hint for memory-mapped IO and is not sufficient for inter-thread synchronization — use std::atomic instead. Java’s volatile is stronger (full happens-before, JSR-133 2004) and roughly equivalent to C++ memory_order_seq_cst. Go’s sync/atomic package gives sequential consistency by default; the Go memory model was formally documented in 2022.

15. Profiling tools

The right tool depends on whether you need time (where cycles go), events (what hardware events fire), or off-CPU (where you’re blocked).

Linux — perf (built into kernel; the canonical Linux profiler):

  • perf stat — counts events (cycles, instructions, cache-misses, branch-misses, IPC) over a workload.
  • perf record + perf report — sampled stack profiles; events configurable (-e cycles, -e cache-misses, -e cpu/event=0xc7,umask=0x01/).
  • perf top — live system-wide.
  • perf c2c — cache-to-cache contention / false sharing.
  • perf mem — memory access sampling (PEBS on Intel, IBS on AMD).
  • LBR (Last Branch Records, 32 entries) — --call-graph lbr for cheap call-graph sampling.
  • Intel Processor Trace / ARM CoreSight — full instruction-level traces.

Intel VTune Profiler — best-in-class for Intel hardware; hotspot, microarchitecture exploration, memory access, threading, HPC characterization. Free for personal/educational use.

AMD uProf — equivalent on AMD; uses IBS (Instruction-Based Sampling).

Apple Instruments (Time Profiler, Allocations, System Trace, Counters) — for macOS/iOS; backed by kperf.

Windows: Windows Performance Analyzer (WPA) + Windows Performance Recorder (WPR); ETW (Event Tracing for Windows).

eBPF-based — modern Linux profiling without recompilation:

  • bcc / BPF Compiler Collection (Brendan Gregg et al., Netflix) — profile, offcputime, biolatency, tcpconnect, runqlat.
  • bpftrace — awk-like DSL for ad-hoc BPF tracing.
  • Parca (Polar Signals) — continuous profiling, eBPF-based, Prometheus-style.
  • Pixie (New Relic) — Kubernetes observability via eBPF.
  • Pyroscope (Grafana, merged with Phlare 2023) — continuous profiling.
  • Polar Signals Cloud — managed continuous profiling.

Statistical samplers for managed runtimes:

  • py-spy (Python) — sampling profiler, no GIL needed.
  • async-profiler (JVM) — AsyncGetCallTrace-based, low overhead, also reads CPU PMU.
  • dotnet-trace + PerfView (.NET).
  • rbspy (Ruby).
  • node —prof / 0x (Node.js).

Tracers (event recording, not sampling):

  • ftrace — Linux in-kernel tracer.
  • LTTng — userspace + kernel tracing, low overhead.
  • DTrace — Solaris-origin; available on macOS, FreeBSD; Windows port in 10/11.
  • strace / ltrace — syscall / library-call tracing (high overhead).

Microbenchmarking:

  • Google Benchmark (C++) — anti-DCE tricks, statistical reporting.
  • Catch2 micro-benchmarks.
  • JMH (Java Microbenchmark Harness) — handles JIT warmup, dead-code elimination, false sharing.
  • Criterion (Rust).

Flame graphs (Brendan Gregg) — interactive SVG/folded stacks for hot-path identification. Differential flame graphs compare two profiles to highlight regressions.

16. Performance methodologies

  • USE method (Brendan Gregg) — for every resource (CPU, memory, disk, network, etc.) check Utilization, Saturation (queue depth / wait time), Errors. Triage in this order before deep-diving.
  • RED method (Tom Wilkie, Weaveworks) — for services: Rate (req/s), Errors (% failed), Duration (latency percentiles).
  • Four Golden Signals (Google SRE) — latency, traffic, errors, saturation.
  • TSA — Thread State Analysis (Gregg) — categorize thread time into On-CPU vs Off-CPU (blocked on I/O, lock, sleep, scheduler runqueue). Off-CPU analysis via offcputime (bcc).
  • Top-down Microarchitecture Analysis (TMAM, Ahmad Yasin, Intel 2014) — hierarchical decomposition of cycles into Front-End-Bound, Back-End-Bound (Memory or Core), Bad Speculation, Retiring. Implemented in perf stat --topdown and VTune.
  • Flame graphs — interactive folded-stack visualization; canonical for hotpath identification.
  • Differential / offset flame graphs — visually compare two flame graphs (before vs after a change).
  • Heat maps — latency vs time; useful for tail-latency analysis.
  • Latency percentiles, not averages — tail latency (p99, p99.9, p99.99) characterizes user-perceived performance; average masks pathological cases. HDR-Histogram (Gil Tene) is the canonical lossless percentile data structure. The coordinated omission problem (Tene 2013) explains why naive load-generator latency measurements understate tail latency; tools like wrk2 and oha compensate.
  • Anti-patterns: ignoring warmup (cold caches, JIT compilation), measuring on a noisy host, ignoring frequency scaling, comparing single runs without confidence intervals, microbenchmark dead-code elimination (compiler removes the work being measured).

17. CPU performance counters (PMU)

Every modern CPU exposes hardware counters via the PMU (Performance Monitoring Unit). Key counters:

CounterIndicates
cycles, instructionsIPC; <1 indicates stall-bound
cache-misses, cache-referencesoverall miss rate
L1-dcache-load-missesL1 data miss rate
LLC-load-missesL3 miss → DRAM
dTLB-load-misses, iTLB-load-missesTLB pressure
branch-misses, branchesbranch predictor accuracy
page-faultsminor + major faults
context-switches, cpu-migrationsOS interference
cycle_activity.stalls_*reasons for backend stalls
mem_load_retired.l3_missprecise (PEBS) DRAM-served loads

Access methods:

  • perf stat -e <events> — easiest.
  • PAPI (Performance Application Programming Interface) — portable C API, since 1999.
  • rdpmc (x86) — user-mode counter reads after enabling.
  • likwid — Erlangen toolkit for HPC performance counters.
  • Linux perf_event_open(2) syscall — raw access.

PEBS (Intel Precise Event Based Sampling) and IBS (AMD Instruction Based Sampling) attribute counter events to exact instructions, eliminating the skid that plagues older interrupt-based sampling.

18. Memory bandwidth

The STREAM benchmark (McCalpin, 1995) is the canonical memory bandwidth measure: four kernels (Copy, Scale, Add, Triad) that exercise sustainable bandwidth.

Current numbers (per socket, 2024):

  • Intel Sapphire Rapids — 8 channels DDR5-4800: ~280 GB/s; DDR5-6400 upgrades on Emerald Rapids push toward ~410 GB/s. Xeon Max with HBM2e: ~1 TB/s on-package.
  • AMD EPYC Genoa — 12 channels DDR5-4800: ~430 GB/s.
  • AMD EPYC Bergamo / Turin — similar/higher.
  • NVIDIA Grace — LPDDR5X 480 GB/s per Grace chip.
  • Apple M3 Pro / Max — unified LPDDR5: ~150 GB/s (Pro) to 400 GB/s (Max).
  • AWS Graviton4 — 12 channels DDR5-5600: ~535 GB/s.
  • GPU HBM3 — H100 SXM 3.35 TB/s; B200 8 TB/s; MI300X 5.3 TB/s.

Memory-bandwidth-bound workloads (large matrix transpose, streaming aggregates, deep-learning inference at low batch) scale with bandwidth not FLOPS. The arithmetic intensity (FLOPs/byte) determines which side of the roofline you sit on (Williams, Waterman & Patterson, 2009).

19. Cache-aware algorithms

  • Loop tiling / blocking — for a matrix-multiply C = A·B, naively iterating in N×N×N order thrashes L1. Tile into B×B blocks where 3·B² fits in L1 (one tile each of A, B, C). Optimal B ≈ √(L1 / 3 / sizeof(elt)). Then re-tile for L2 and L3.
  • Cache-oblivious algorithms (Frigo, Leiserson, Prokop, Ramachandran 1999 FOCS) — recursive divide-and-conquer (matrix multiply, FFT, sort) automatically adapts to every level of the hierarchy without tuning per cache size. Performance within constants of cache-aware tiled code, with portability.
  • Hash table layout — open-addressing with linear or quadratic probing (Google’s Swiss Tables, Abseil; Robin Hood hashing) trounces separate chaining for cache locality; chained hash tables suffer one cache miss per chain link.
  • SoA vs AoS — Struct-of-Arrays packs same-field values contiguously, allowing vector loads; Array-of-Structs scatters fields. Game engines / physics / ML data pipelines use SoA. ECS architectures (Entity-Component-System) are SoA-by-construction.
  • Z-order / Morton order — interleave bits of x,y coordinates for 2D spatial locality (used in image tiles, octrees, BVH for ray tracing).
  • Cache-oblivious B-trees / fractal trees — TokuDB / PerconaFT used these.
  • Packed structures__attribute__((packed)) saves space but kills alignment; trade carefully.
  • Bit-packing for bool arrays, set membership (Bloom filters fit in cache where Roaring bitmaps might not).
  • Working-set sizing — design the hot data structures to fit in the level they’re accessed from: per-request scratch in L1 (~32 KB), per-thread state in L2 (~1 MB), per-core caches and lookup tables in L3 (a few MB). Anything beyond hits DRAM at full latency on every miss.
  • Hot/cold splitting — separate frequently-touched fields from rarely-touched ones; e.g. a User struct with hot (id, last_seen) separated from cold (profile, settings) lets the hot path keep more entries per cache line.
  • Bloom + cuckoo filters — probabilistic membership in a few bits per element; far smaller than the data they index and fit easily in L1/L2 for billion-element sets.

For an empirical benchmark of cache effects, see Igor Ostrovsky’s classic Gallery of Processor Cache Effects (2010) which walks through 7 surprising microbenchmarks demonstrating each cache level boundary.

20. Common performance pitfalls

  • Pointer chasing — linked lists, naive trees of small nodes; every node is a TLB miss + cache miss. Replace with arenas + arrays, B-trees, or hash maps.
  • False sharing in concurrent counters / queues — pad to 64 B (Section 5).
  • Allocation in the hot pathmalloc + free cost 20-100 ns each, plus cache pollution. Use object pools, arena allocators (Boost.Pool, talloc), or stack allocation for short-lived data.
  • Boxing of primitives in JavaList<Integer> autoboxes to heap-allocated boxes. Use int[], IntStream, Eclipse Collections, fastutil, or HPPC.
  • Indirect calls without devirtualization — virtual methods / function pointers cost a BTB lookup and prevent inlining. JITs do speculative inlining; AOT compilers need link-time optimization (LTO). C++ final keyword enables devirtualization.
  • Excessive polymorphism / deep inheritance — vtables fragment instruction cache and inhibit inlining.
  • Object headers — Java objects have 12-16 B header; small objects are mostly overhead. Project Valhalla (value types, JEPs 401/402) addresses this.
  • Forgetting compiler optimization flags-O3 -march=native -flto on GCC/Clang; /O2 /arch:AVX2 /GL on MSVC. Profile-guided optimization (PGO) + LTO often deliver another 10-20%.
  • Synchronous I/O on hot path — block on disk/network instead of using io_uring, epoll, or async runtimes.
  • Logging in hot loops — formatting (printf, std::format, String.format) is surprisingly expensive; gate with sampling or lazy evaluation.
  • Lock contention on shared atomics — a single std::atomic<int> counter incremented by N threads serializes through cache coherence; throughput collapses past ~4 threads. Replace with per-thread counters summed periodically (sharded counters) or __rdtsc-based statistical sampling.
  • Cold-path code in hot paths — error-handling, debug-logging branches inlined into the fast path inflate the I-cache footprint and dilute the branch predictor. Mark with __builtin_expect (likely/unlikely) or [[likely]] / [[unlikely]] (C++20) to push them out-of-line via PGO.
  • Denormal floating point — IEEE 754 denormals (subnormals) trap to microcode on most x86, costing 100-1000 cycles per operation. Set MXCSR.FTZ and DAZ (Flush-To-Zero, Denormals-Are-Zero) in audio/DSP/ML code via _MM_SET_FLUSH_ZERO_MODE / _MM_SET_DENORMALS_ZERO_MODE.
  • Page faults at startup — first touch of mmap-backed memory triggers a minor fault per page; pre-touch large allocations (MAP_POPULATE, or a memset loop) before the latency-critical phase begins. JVM -XX:+AlwaysPreTouch does this for the heap.
  • CPU frequency scaling / C-statescpufreq governors (powersave, ondemand, performance) and deep C-states (C6, C7) add wakeup latency; benchmarks should pin to performance governor and disable C-states deeper than C1 for repeatable numbers.

21. AI/ML kernels

ML workloads are dominated by a few primitives, each with known cache/bandwidth behavior:

  • GEMM (General Matrix Multiply) — compute-bound at large sizes thanks to O(N³) work for O(N²) data. Highly tiled (BLIS / OpenBLAS / MKL / cuBLAS); tensor-core GEMMs on NVIDIA Hopper hit >90% of peak FLOPS.
  • GEMV (matrix-vector) — bandwidth-bound; arithmetic intensity O(1).
  • Attention — Q·Kᵀ → softmax → ·V. Naive implementation materializes the N×N attention matrix; FlashAttention (Dao et al. 2022, FlashAttention-2 2023, FlashAttention-3 2024) reshapes the loops so the softmax never materializes, dramatically reducing HBM traffic. See [[Compute/inference-optimization]].
  • Convolutionim2col + GEMM (memory-heavy) or Winograd transforms (Lavin & Gray 2015, reduces multiplies at the cost of additions; standard for 3×3 convs).
  • Tensor cores (NVIDIA Volta+, AMD MI200+ matrix cores, Apple AMX, Google TPU MXU, Intel AMX) — fused mixed-precision multiply-accumulate units (fp16 × fp16 → fp32, bf16, fp8, int8). 4-16× the throughput of regular SIMD on supported types.
  • Quantized inference — int8 / int4 / fp8 / fp4; saturate memory bandwidth with smaller data while keeping enough precision for LLM inference. AVX-512 VNNI on x86, ARM SDOT/UDOT, NVIDIA Sparse Tensor Cores.
  • KV-cache for autoregressive LLM decoding — every new token requires reading the entire prior key/value tensor for each layer; at LLama-3-70B scale and 8K context, KV-cache is 10s of GB and decode is bandwidth-bound on HBM. Paged-Attention (vLLM, Kwon et al. 2023) treats the KV-cache as paged memory to reduce fragmentation.
  • Mixed precision training (Micikevicius et al. 2017) — fp16/bf16 forward + backward with fp32 master weights; doubles effective bandwidth and tensor-core throughput vs fp32. Loss scaling prevents fp16 underflow on small gradients.
  • Sparsity — N:M structured sparsity (NVIDIA Ampere 2:4) doubles throughput on tensor cores when 2 of every 4 weights are zero; pruning algorithms train models into this pattern.

CPU inference (llama.cpp, ggml, ONNX Runtime) lives or dies by L2/L3 reuse: a 7B model in q4_K_M quantization is ~4 GB and exceeds any current L3, so per-token latency is set by DRAM bandwidth. Q4_0 int4 weight format, bf16 activations, AVX-512 VNNI for int8 dot product, and ARM bfdot/bfmlalt are the bandwidth-saving primitives that make 70B-parameter LLMs run on consumer M-series laptops at conversational speeds.

22. Cross-references

23. Citations

  • Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann. — canonical reference; Chapters 2 (Memory Hierarchy), 3 (ILP), 5 (Multiprocessors).
  • Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd ed. Pearson. — USE method, flame graphs, BPF performance tools.
  • Bryant, R. E., & O’Hallaron, D. R. (2015). Computer Systems: A Programmer’s Perspective, 3rd ed. Pearson. — cache memory chapter; assembly-level performance.
  • Drepper, U. (2007). What Every Programmer Should Know About Memory. Red Hat technical report. — definitive deep dive on DRAM, cache, and software implications.
  • Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-Oblivious Algorithms. FOCS ‘99.
  • Kocher, P., et al. (2018). Spectre Attacks: Exploiting Speculative Execution. USENIX Security 2019 / IEEE S&P 2019.
  • Lipp, M., et al. (2018). Meltdown: Reading Kernel Memory from User Space. USENIX Security 2018.
  • Papamarcos, M. S., & Patel, J. H. (1984). A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories. ISCA ‘84. — original MESI.
  • Yeh, T.-Y., & Patt, Y. N. (1991). Two-Level Adaptive Training Branch Prediction. MICRO ‘91.
  • Seznec, A. (2006). A case for (partially) tagged geometric history length branch prediction. JILP. — TAGE predictor.
  • Jaleel, A., et al. (2010). High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). ISCA ‘10.
  • Yasin, A. (2014). A Top-Down Method for Performance Analysis and Counters Architecture. ISPASS ‘14. — TMAM.
  • McCalpin, J. D. (1995). Memory Bandwidth and Machine Balance in Current High Performance Computers. — STREAM benchmark.
  • Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. CACM 52(4).
  • Lavin, A., & Gray, S. (2015). Fast Algorithms for Convolutional Neural Networks. — Winograd convolution.
  • Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
  • Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM), Volumes 1-4. Intel, current edition.
  • ARM Architecture Reference Manual for A-profile architecture (ARMv8 / ARMv9). ARM Ltd., current edition.
  • RISC-V “V” Vector Extension Specification, v1.0 ratified 2021.
  • Tomasulo, R. M. (1967). An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of R&D 11(1). — out-of-order execution.
  • Micikevicius, P., et al. (2017). Mixed Precision Training. ICLR 2018. — fp16 training with master weights and loss scaling.
  • Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. — vLLM KV-cache paging.
  • Tene, G. (2013). How NOT to Measure Latency. Strange Loop talk. — coordinated omission.
  • Ostrovsky, I. (2010). Gallery of Processor Cache Effects. blog post / canonical microbenchmark reference.
  • Brendan Gregg’s The Flame Graph (2016, CACM 59(6)) — folded stack visualization formalism.