Garbage Collection — Compute Reference
1. At a glance
Garbage collection (GC) is automatic memory management: the runtime reclaims storage occupied by objects that are no longer reachable from the program’s roots (stacks, globals, registers, JNI handles). The programmer is freed from malloc/free discipline; the cost is some combination of CPU overhead, memory overhead, and pause time.
The fundamental trade-offs every GC design negotiates:
- Throughput — fraction of CPU spent in the mutator (application) vs the collector. A throughput collector maximizes work per second at the cost of long pauses.
- Latency / pause time — how long mutator threads are stopped. Critical for serving workloads (web, RPC, trading). A latency collector keeps pauses short at the cost of throughput overhead.
- Memory overhead / footprint — semispace copying needs 2× the live heap; reference counting needs a counter per object; concurrent collectors need write/read barriers and remembered sets.
- Developer ergonomics — a tracing GC lets you write idiomatic graph code; refcount + manual cycle breaking forces explicit
weak/unowned; arenas force lifetime discipline.
Two large families:
- Tracing GC — start from roots, traverse the live object graph, mark everything reachable, then sweep or relocate the unmarked. Originator: John McCarthy, Recursive Functions of Symbolic Expressions and Their Computation by Machine (CACM, 1960), for LISP.
- Reference counting (RC) — each object carries a count of incoming references; when the count hits zero, the object is freed. Originator: George Collins, A method for overlapping and erasure of lists (CACM, 1960). Used in CPython, Swift ARC, Objective-C ARC, Perl, PHP.
Most production runtimes combine both ideas and layer on generational organization, region-based collection, concurrent marking, and write/read barriers. The Jones–Hosking–Moss Garbage Collection Handbook (2nd ed., 2023) is the standard reference.
2. Tracing GC fundamentals
Mark-sweep (McCarthy, 1960)
Two phases:
- Mark. Traverse the object graph from roots; set a mark bit on every reachable object.
- Sweep. Linearly scan the heap; objects without the mark bit are added to a free list (or the entire region is reset).
Simple and complete (handles cycles for free, unlike RC). Drawbacks: leaves fragmentation — free space is scattered across the heap as a free list, hurting allocation locality and forcing size-class allocators. Sweep cost is proportional to heap size, not live size.
Mark-compact
After marking, slide live objects toward one end of the heap to eliminate fragmentation. Variants: Lisp 2 (two-pointer slide), threaded compaction (Jonkers 1979), and break-table. Better cache behavior and bump-pointer allocation afterwards; costlier than plain sweep.
Copying / Cheney (Cheney, A nonrecursive list compacting algorithm, CACM, 1970)
Split the heap into two semispaces (from-space, to-space). Allocate in from-space. At collection time, traverse roots; copy every reachable object into to-space; update references; swap. Sweep is free — the entire from-space is reclaimed. Cheney’s algorithm uses the to-space itself as the BFS queue (no auxiliary stack).
Pros: bump-pointer allocation (very fast — just increment a pointer), instant compaction, cost proportional to live data only (not heap size). Cons: 2× memory cost (only half the heap usable at any time).
Tri-color marking (Dijkstra, Lamport, Martin, Scholten, Steffens, On-the-fly garbage collection: an exercise in cooperation, CACM, 1978)
Abstract framework that enables concurrent marking. Each object is one of three colors:
- White — not yet visited (presumed garbage).
- Gray — visited but children not yet scanned (on the worklist).
- Black — visited and all children gray-or-black (fully scanned).
The collector starts with roots gray, repeatedly takes a gray object, scans its references (graying any white children it points to), then turns it black. At end of mark, white = garbage.
Concurrency invariants the mutator must preserve while collection runs:
- Strong tri-color invariant (Dijkstra): no black object points to a white object.
- Weak tri-color invariant (Yuasa): every white object pointed to by a black object is reachable from some gray object.
Maintaining these invariants is what write barriers are for (section 4).
Stack maps and precise vs conservative GC
A precise (a.k.a. “type-accurate” or “exact”) collector knows for every word on the stack and in every object whether it’s a pointer or a primitive. This is the prerequisite for relocating GC (you have to know which slots to update). Compilers emit stack maps at every safe-point describing the live pointer set; HotSpot, V8, .NET, Go, and OCaml are all precise.
A conservative collector treats every word that could plausibly be a pointer (lies within the heap range, is properly aligned) as if it were one. It can be added to runtimes that didn’t plan for GC — the Boehm-Demers-Weiser GC (Boehm + Weiser 1988) is the canonical example, used by Mono historically, GNU libgc clients, and various Scheme implementations. Drawbacks: cannot relocate (a “pointer” might actually be an integer that mustn’t be rewritten); occasional false retention because a stale stack word holds an address-shaped integer. Production VMs almost all moved to precise GC by the mid-2000s.
3. Generational hypothesis (Lieberman + Hewitt, 1983)
Henry Lieberman and Carl Hewitt, A real-time garbage collector based on the lifetimes of objects (CACM, 1983), formalized what LISP implementers had observed: most objects die young. The empirical “weak generational hypothesis” says the age distribution of objects is heavily skewed toward short lifetimes; the strong version adds that older objects rarely point to younger ones.
Generational design:
- Partition the heap into young generation (a.k.a. nursery, eden) and old generation (a.k.a. tenured).
- Allocate in young gen — typically a copying semispace, so allocation is bump-pointer fast.
- Minor GC collects only young gen — cheap because young gen is small and most of it is dead. Survivors are copied (often through one or more survivor spaces) and eventually promoted to old gen after surviving N collections (tenuring threshold).
- Major / full GC collects the whole heap — rare, more expensive.
This dramatically improves throughput because the common case (a recently-allocated temporary) is handled by a fast copying pass over a small region. HotSpot, .NET CLR, V8, OCaml, GHC, MRI Ruby, PyPy, and ZGC (since Java 21) all implement generational schemes.
4. Write barriers
For generational GC: if you only scan young gen during a minor collection, you must still consider old-to-young pointers as roots. Otherwise an old object that points to a young live object would let you incorrectly collect the young object. The runtime needs to know about every such pointer.
Write barrier = a small piece of code emitted at every heap-pointer-store, run by the mutator. It records the location of the store so the GC can find old-to-young references without scanning all of old gen.
Implementations:
- Card table — divide old gen into fixed-size cards (typically 512 B). When an old-gen reference is written, mark its card “dirty” in a byte array. At minor GC, scan dirty cards for old-to-young pointers. Used by HotSpot, .NET CLR.
- Remembered set (RSet) — per-region (G1) or per-object set of incoming references. More precise than cards; higher overhead. G1 uses a hybrid.
- Snapshot-at-the-beginning (SATB) logging — for concurrent collectors. Record the old value of each overwritten reference into a buffer; the collector treats those as conceptually still live for this cycle.
Concurrent collectors layer barriers for tri-color invariants too: a Dijkstra (insertion) barrier turns a white child gray on store; a Yuasa (deletion / SATB) barrier preserves the original target so it isn’t lost.
5. Read barriers
A read barrier intercepts every heap-pointer-load. They are heavier than write barriers (loads outnumber stores) and historically avoided, but concurrent relocating collectors (ZGC, Shenandoah, Azul C4) require them.
Why: if the collector is moving objects concurrently with the mutator, a stale reference might point to the old location. The read barrier checks a marker (in the pointer itself, or in a forwarding slot in the object header) and forwards the read to the new location, optionally healing the stale reference for next time.
- Brooks pointers (Brooks, 1984; used by Shenandoah) — every object has a forwarding word; reads dereference through it.
- Colored / tagged pointers (Azul C4, ZGC) — use unused high bits of the 64-bit pointer to encode GC state (marked, remapped, finalizable); the barrier is a load + bit-test + branch, with hardware support (multi-mapped views of memory) so the same physical page maps at multiple virtual addresses.
Modern x86_64 and AArch64 give ZGC the headroom to use 4 metadata bits in pointers because virtual addresses are only 48 bits wide in practice.
6. Concurrent GC
The collector runs concurrently with the mutator instead of pausing it. The hard parts:
-
The object graph is changing while the collector is walking it. Tri-color invariants (section 2) must be preserved by barriers.
-
Two main strategies:
- Incremental update (Dijkstra-style): on a heap-pointer store, if the target is white and the source is black, gray the target. Keeps the strong invariant.
- Snapshot at the beginning (SATB) (Yuasa, 1990): on a store that overwrites an old value, record the old value into a marking buffer. The collector treats everything reachable at the start of the cycle as live, even if subsequently disconnected — slight floating garbage, but simpler proof.
-
Final mark / remark is the dangerous moment: you must process the remaining gray-set including everything the barriers buffered. Most collectors do this in a short stop-the-world (STW) pause.
Concurrent sweep is straightforward (sweep regions the mutator isn’t allocating in). Concurrent relocation is the hardest — requires read barriers (section 5).
7. Region-based GC
Partition the heap into many fixed-size regions (typically 1–32 MB). Each cycle, the collector chooses a collection set — a subset of regions to evacuate, usually those with the most garbage (“garbage first”). It then copies survivors out of those regions, freeing the regions wholesale.
Benefits:
- Pause-time predictability — collect more or fewer regions per cycle to hit a pause goal.
- Incrementality — never need to touch the entire heap at once.
- Fragmentation control — evacuation compacts naturally.
Cost: per-region remembered sets to track inter-region references (the write barrier maintains them). Region-based designs: G1, ZGC, Shenandoah, Azul C4.
8. Reference counting
Each object stores a counter; assignment increments the new target and decrements the old; on zero, the object is freed (recursively decrementing its children’s counts).
Pros: deterministic destruction (good for RAII-like cleanup); incremental cost spread through execution; no STW pauses inherent to the algorithm; locality (freed object’s children are often still in cache).
Cons:
- Cycles are never collected. A → B → A keeps both counts at 1 forever. Solutions: programmer-managed
weak/unownedrefs (Swift, Objective-C) or a cycle detector that periodically does a partial trace (CPython, PHP, Perl). - Atomic counter updates are needed in multithreaded code. Atomic increments/decrements are expensive (~20 ns) and contended counters serialize cores. Swift uses optimizations (biased reference counting, deferred releases) to amortize.
- Mutator overhead at every assignment, even for short-lived locals (compilers elide some).
Real-world RC runtimes:
- CPython — RC + a generational cycle detector (three generations) that periodically runs a partial trace to break cycles. The GIL serializes counter updates so atomic ops are unnecessary; PEP 703 (no-GIL CPython 3.13+ experimental) reintroduces atomics + biased RC.
- Swift / Objective-C ARC — compiler-inserted retain/release calls; programmer breaks cycles with
weakandunowned. ARC = automatic reference counting. - Perl — RC; cycle leaks are a known footgun; weaken via
Scalar::Util::weaken. - PHP (Zend) — RC + cycle collector (synchronous trial-deletion since PHP 5.3).
- C++
shared_ptr— RC built into the standard library;weak_ptrfor cycles.
9. HotSpot JVM GCs (1996 →)
The OpenJDK HotSpot VM has shipped many collectors. Choose with -XX:+UseSerialGC, -XX:+UseParallelGC, -XX:+UseG1GC, -XX:+UseZGC, -XX:+UseShenandoahGC, -XX:+UseEpsilonGC.
Serial GC
Single-threaded stop-the-world; copying young gen + mark-compact old gen. For small heaps (< ~100 MB) and single-core machines. Default for client-class VMs.
Parallel GC (a.k.a. Throughput)
Multi-threaded stop-the-world; same algorithms as Serial but parallelized across cores. Default in Java 8. Maximizes throughput at the cost of long pauses on large heaps.
CMS (Concurrent Mark-Sweep)
Concurrent mark + concurrent sweep of old gen; STW young gen copy. Shipped Java 4; deprecated Java 9 (JEP 291); removed Java 14 (JEP 363). Suffered from heap fragmentation (no compaction) and “concurrent mode failure” fallback to a long STW full GC.
G1 (Garbage First)
Designed by Detlefs, Flood, Heller, Printezis, Garbage-first garbage collection (ISMM 2004). Region-based generational; concurrent marking + STW evacuation. Default in Java 9+. Pause-time target via -XX:MaxGCPauseMillis (default 200 ms); G1 chooses a collection set sized to fit. Handles heaps from a few hundred MB up to tens of GB. Concurrent marking is SATB.
ZGC (Z Garbage Collector)
Per Tene, The Z Garbage Collector — An Introduction (JFokus 2018); shipped Java 11 as experimental, GA in Java 15 (JEP 377). Region-based, fully concurrent marking + relocation; colored pointers (metadata bits in pointer) + load barriers for in-flight relocation. STW pauses are typically < 1 ms regardless of heap size; tested on heaps up to 16 TB. Generational ZGC shipped in Java 21 (2023, JEP 439) — adds a young/old split for throughput.
Shenandoah
Red Hat project (Kennke et al., started 2014); shipped Java 12 (JEP 189). Concurrent compaction using Brooks forwarding pointers (per-object header word); concurrent marking SATB. Pause times consistently < 10 ms even on multi-TB heaps. Read + write barriers; ~10–15% throughput cost typical.
Phase pipeline:
- Init Mark (short STW) — scan roots, start concurrent mark.
- Concurrent Marking — tri-color SATB mark of the live graph.
- Final Mark (short STW) — drain SATB buffers, choose collection set.
- Concurrent Cleanup — free regions that are 100% garbage.
- Concurrent Evacuation — copy live objects out of collection-set regions to free regions (Brooks pointer forwarding installed).
- Init Update Refs (short STW) — install root updates.
- Concurrent Update Refs — sweep heap rewriting stale pointers.
- Final Update Refs (short STW) — finish root rewriting + reclaim collection-set regions.
The hot path overhead is in the read barrier (one extra indirection through the Brooks word) and the SATB write barrier (a compare + conditional buffer push).
Epsilon
JEP 318, Java 11. A no-op GC: allocates, never collects, dies of OOM. Useful for short-lived programs, latency benchmarking, performance regression detection (factor out GC noise).
10. .NET CLR GC
Microsoft .NET’s GC is generational, region-aware, and concurrent. Three generations:
- Gen 0 — newest. Collected most often. Bump-pointer allocation in a contiguous region.
- Gen 1 — buffer between Gen 0 and Gen 2; objects that survive a Gen 0 collection.
- Gen 2 — long-lived; collected rarely.
- Large Object Heap (LOH) — objects ≥ 85 KB. Allocated separately (avoids copy overhead on large arrays); originally never compacted, made compactable on demand in .NET 4.5.1.
- Pinned Object Heap (POH) — .NET 5+; for objects that interop pins (e.g., GC handles for native code).
Modes:
- Workstation GC — lower latency, single GC thread; for client/UI apps.
- Server GC — one heap and one GC thread per logical core; high throughput, larger memory; default for ASP.NET.
- Background GC — concurrent old-gen marking + sweeping while mutator runs; foreground collections for Gen 0/1.
.NET 7 introduced DATAS (Dynamically Adapting To Application Sizes) — heap-size autotuning that reduces RSS on idle workloads. Tuning surfaces: <ServerGarbageCollection>, <ConcurrentGarbageCollection>, DOTNET_gcServer, DOTNET_GCHeapCount, GC.TryStartNoGCRegion for latency-critical sections.
11. Go runtime GC
Go’s collector is a concurrent tri-color non-generational mark-sweep, shipped in Go 1.5 (2015) and progressively improved. Design priorities (per Hudson + the Go team): low pause time, simple tuning surface, no generational hypothesis assumed.
Properties:
- Concurrent marking + concurrent sweep; STW pauses ~ tens to hundreds of microseconds.
- Write barrier always on (hybrid Dijkstra+Yuasa since Go 1.8). No special mark/sweep phase toggle.
- Non-generational. Compensates with very fast bump-pointer allocation in per-P (per-processor) caches.
- Single tuning knob:
GOGC(default 100 — trigger GC when heap doubles relative to live size after last cycle).GOMEMLIMIT(Go 1.19+) sets a soft memory ceiling. - No compaction. Frees are returned to size-classed allocator (Plan 9 tcmalloc-derived).
- Stack scanning: goroutine stacks are scanned at safe-points; stack growth is movable so scanning is cheap.
Trade-offs: simpler, more predictable than HotSpot/CLR. Higher CPU overhead than a generational design for typical “lots of small short-lived objects” workloads, which the Go team has historically accepted in exchange for predictable pauses.
History: pre-1.5 Go had a stop-the-world parallel mark-sweep (pauses of seconds at multi-GB heaps). Go 1.5 (August 2015, Hudson + Cheng) introduced the concurrent collector; Go 1.8 (Feb 2017) eliminated the stop-the-world stack rescan via the hybrid write barrier; Go 1.12 (Feb 2019) brought concurrent sweep termination; Go 1.19 (Aug 2022) added GOMEMLIMIT to give operators a memory-pressure dial separate from GOGC.
The Go team explicitly considered and rejected generational GC for years on the grounds that the write barrier cost (always on) was already paid, escape analysis stack-allocates many short-lived objects so the nursery’s value is reduced, and the simplicity gain was worth the throughput trade. As of 2024 the team has begun publishing experiments with a generational design (“Green Tea GC”), suggesting that calculus may eventually change.
12. V8 (JavaScript) — Orinoco
V8 (Chrome, Node.js, Deno) implements Orinoco — a generational, mostly concurrent collector.
- Young generation — semispace Scavenger (Cheney copying); parallel + incremental; collects “Nursery” + “Intermediate” sub-spaces. Bump-pointer allocation. Survivors promoted to old gen after surviving one or two minor cycles.
- Old generation — Mark-Compact; concurrent marking (since 2018) on background threads; STW for atomic mark-end and compaction (incremental in chunks).
- Read/write barriers maintain remembered sets between old and new gen.
- Idle-time GC — V8 receives idle hints from Chrome and Node and opportunistically runs collection.
V8 also has Code Space, Map Space, and Large Object Space for specialized objects.
13. Erlang / BEAM
Erlang’s process model gives it an unusual GC story: each process has its own private heap. Inter-process messages are copied (for small immutable data, by reference into a refcounted binary heap).
Consequences:
- Per-process GC is local — pauses are tiny because per-process heaps are tiny (kilobytes to a few MB).
- A long-running process can do a major collection without blocking other processes.
- No shared-heap STW pause across the VM.
- Algorithm per process: generational copying (Cheney-style) for normal terms; reference-counted off-heap “binary” heap shared across processes for large binaries (≥ 64 bytes).
This is the same idea region-based collectors approximate, taken to a per-actor extreme.
14. Python (CPython)
CPython is reference counted + a generational cycle detector (in gc module).
- Every
PyObjecthas anob_refcntfield;Py_INCREF/Py_DECREFmacros. - Three cycle-detector generations (0, 1, 2). Periodically a generation is traced to find unreachable cycles; the triggering threshold is allocation count.
- No concurrent GC because the GIL serializes interpreter threads — there is no concurrent mutator to worry about, and refcount updates are non-atomic plain integer ops.
- PEP 703 (Sam Gross, 2023; experimental in CPython 3.13) — “no-GIL” CPython. Switches to biased reference counting + immortal objects for hot constants; atomic ops only for cross-thread references. Significantly changes the GC contract for C extensions.
- Finalizers via
__del__interact poorly with cycles; the cycle collector punts on finalizable cycles unless they’re acyclic-after-finalize (since Python 3.4, PEP 442).
15. Ruby (MRI / CRuby)
MRI is mark-sweep, generational (since Ruby 2.1, 2013), and incremental (since Ruby 2.2). No compaction historically; GC.compact added in Ruby 2.7 (2019), still triggered manually.
- Two generations; tenured objects survive minor GC.
- Write barriers maintain old-to-young pointers; objects without write barrier support are kept in a “remembered” old-gen set.
- Incremental marking spreads the mark phase across many small slices to bound pauses.
GC::ProfilerandGC.statfor tuning.
16. PyPy
PyPy is a Python implementation in RPython with a tracing JIT and an incremental generational mark-sweep GC (“incminimark”):
- Young objects allocated in a nursery (bump-pointer); minor collection copies survivors to a non-moving old gen.
- Old gen is mark-sweep (non-moving) so that JIT-compiled code holding old pointers stays valid.
- Incremental marking; write barriers maintain remembered sets.
- JIT-friendly because allocation in the nursery is one machine instruction (bump a register).
17. GHC Haskell
GHC’s RTS uses a block-structured copying generational collector with an optional non-moving old-gen collector for low-latency workloads (since GHC 8.10.1, 2020).
- Default: two generations, copying in both — old gen copy is the expensive case.
+RTS -xn(since 8.10) — non-moving concurrent mark-sweep old gen: pauses become bounded (sub-millisecond at large heaps), at the cost of fragmentation and somewhat more memory.- Parallel GC across capabilities (cores); per-capability nursery; remembered sets via write barriers (
MUT_VARetc.). - Region-friendly design due to immutability — most allocations are write-once.
18. OCaml
OCaml’s two-generation collector is famously fast (Damien Doligez et al.):
- Minor heap — small (default ~256 KB), copying; allocation is a single bump-pointer; minor GC is sub-millisecond.
- Major heap — incremental mark-and-sweep with optional compaction; runs interleaved with mutator (a “slice” of major work per minor collection).
- Write barriers (
caml_modify) maintain a remembered set for major-to-minor references. - Multicore OCaml (OCaml 5, 2022) — per-domain minor heap, shared major heap; new concurrent major collector by KC Sivaramakrishnan + team.
OCaml’s minor collection is the canonical example of “allocation is cheaper than reuse” — short-lived closures cost essentially nothing.
19. WebAssembly GC proposal
Standardized as WasmGC (W3C Wasm Working Group, Phase 4 in 2023; shipped in V8/Chrome 119, Firefox 120). Until WasmGC, GC’d languages compiling to Wasm had to ship their own GC inside the linear-memory sandbox — costly in code size and unable to share an object graph with the JS engine’s collector.
WasmGC adds:
- Reference types beyond
funcref/externref. - Struct, array, and i31ref types with the host runtime’s allocator + GC.
- Subtyping for OOP.
Result: Java (CheerpJ, J2CL), Kotlin/Wasm, OCaml (wasm_of_ocaml), Dart, Scala.js, and various Scheme/Scheme-likes can compile to Wasm and share the host’s GC. Smaller binaries, integration with browser tooling.
20. Region-based / arena allocation (no GC)
Some languages skip tracing GC entirely by giving the programmer ownership tools:
- Rust — ownership + borrowing checked at compile time;
Dropruns deterministic destructors. No GC, no refcounting by default;Rc<T>/Arc<T>available for shared ownership. Cycles inRcleak unless broken withWeak<T>. - Zig — explicit
Allocatorparameter passed around;ArenaAllocator,FixedBufferAllocator,GeneralPurposeAllocator.deferruns cleanup at scope exit. Memory bugs surface in tests becauseGeneralPurposeAllocatorchecks for leaks. - C / C++ — manual + RAII; smart pointers (
unique_ptr,shared_ptr); arenas (std::pmr::monotonic_buffer_resource). - Erlang process arena — per-process heap discarded on process exit; one of the cleanest arena patterns in practice.
- Apache Arrow / DuckDB / TigerBeetle — arena-allocated columnar data; lifetimes follow query boundaries.
- Pony — reference capabilities (
iso,ref,val,box,tag) statically prove freedom from data races; per-actor heap with concurrent GC inspired by Orca.
Region/arena patterns trade developer effort (lifetime discipline, sometimes verbose annotations) for predictable, low-overhead memory behavior. The Rust trade-off has become especially influential in systems written since ~2018.
21. MMTk — Memory Management Toolkit
Blackburn + Cheng, Myths and realities: the performance impact of garbage collection (SIGMETRICS 2004) introduced MMTk as a research framework in JikesRVM. The modern incarnation (mmtk.io, 2020+) is a Rust crate providing pluggable, swappable GC algorithms as a library that runtimes can bind to.
Production bindings:
- OpenJDK — third-party MMTk binding (used to evaluate ZGC alternatives in research).
- V8 — research binding.
- Julia — official MMTk integration shipping; lets Julia swap among Immix, MarkSweep, GenCopy.
- Ruby (CRuby) — work-in-progress binding (Shopify + MMTk team).
- Lua / Lean / others — various academic and language-team experiments.
MMTk’s value is allowing language teams to focus on the runtime/compiler interface and outsource GC algorithm work to specialists. Algorithms supported include Immix (Blackburn + McKinley 2008), GenCopy, GenImmix, MarkCompact, MarkSweep, NoGC, SemiSpace, and StickyImmix.
22. Pauses and latency
Stop-the-world (STW) pause = the time mutator threads are halted at a GC safe-point. Pause time is the metric serving workloads care about — long-tail p99/p999 GC pauses become user-visible latency.
Approximate pause targets (modern, well-tuned):
- Serial / Parallel: 100s of ms to multiple seconds on large heaps.
- G1: 200 ms target by default; achievable to ~50 ms with tuning on multi-GB heaps.
- Shenandoah: sub-10 ms typical, even on multi-TB heaps.
- ZGC: sub-1 ms typical, regardless of heap size; tested to 16 TB.
- Go: tens to hundreds of microseconds (short STW phases sandwiching concurrent work).
- BEAM (Erlang): per-process; tens of microseconds typical.
Pathological cases produce 10+ second pauses. The Cassandra community spent years documenting this: heavily loaded clusters on legacy Parallel/CMS hit GC pauses that exceeded the request timeout, cascading into gossip-detected node failures. The fix was tuning to G1 with smaller heaps + larger young gen, or moving to ZGC.
Real-time GC — bounded work per unit time, so allocation never blocks longer than a programmer-set limit. The IBM Metronome collector (Bacon, Cheng, Rajan, A real-time garbage collector with low overhead and consistent utilization, POPL 2003) is the academic touchstone: time-based scheduling, bounded mutator utilization (e.g., 50% of every 10 ms). Used in IBM’s Java real-time products. Production hard-real-time systems usually skip GC entirely (use C/C++/Ada/Rust).
Time-to-safepoint (TTSP) is a frequently overlooked pause contributor. Even with a sub-millisecond collector, the JVM must wait for every mutator thread to reach a safe-point before STW work can start. A long-running loop without safe-point polls (or a counted loop the JIT compiled past safepoint polling) can stall the entire STW for tens or hundreds of milliseconds. Symptoms: GC log shows tiny “user” and “sys” times but enormous “real” times. Fixes: -XX:+SafepointTimeout, -XX:+UseCountedLoopSafepoints, or refactor the offending loop.
Allocation stalls are another GC-adjacent latency source: when the collector can’t keep up with allocation, the mutator is throttled at allocation sites (ZGC), or blocks waiting for a full STW collection (older collectors). Symptoms: latency spikes correlate with high allocation rate, not with marked-pause events. Profile allocation rate (jcmd <pid> VM.native_memory, JFR) before raising pause-time targets.
23. Throughput vs latency
The fundamental tension. Quick guide for picking a JVM GC:
| Workload | Recommended GC |
|---|---|
| Batch analytics, ETL, heavy compute, no SLA on tail | Parallel GC |
| Web serving, microservices, moderate heap (< 16 GB) | G1 (default) |
| Latency-sensitive serving, large heap, p99 matters | ZGC or Shenandoah |
| Microbenchmarks, AOT-compiled short-lived processes | Epsilon |
Approximate throughput cost of low-latency collectors (relative to Parallel on the same heap):
- G1: ~5% slower than Parallel.
- Shenandoah: ~10–15% slower than Parallel.
- ZGC: ~10–15% slower than Parallel.
The throughput cost buys you orders of magnitude lower pause time. For analytics, that trade is bad; for serving, it is almost always good.
24. Sizing and tuning
Heap sizing
- Too small → frequent GC, allocation pressure, eventual
OutOfMemoryError. - Too large → long pauses (with non-concurrent collectors), wasted RSS, poor cache locality, slow startup (touching pages).
- Rule of thumb: live-set + 2–3× headroom. Profile to find the live set.
Young generation ratio
- Workloads dominated by short-lived allocations (parsers, JSON, request handlers) benefit from a large young gen — survivors get more chances to die before promotion.
- Long-lived caches benefit from a smaller young gen relative to old.
- HotSpot:
-XX:NewRatio,-XX:SurvivorRatio,-XX:MaxTenuringThreshold. G1 sizes regions automatically.
Allocation rate
Profile before tuning:
jmap -histo:live— class-level live-object histogram.- JFR (Java Flight Recorder) — built into HotSpot; very low overhead; emits allocation, GC, and pause events.
- async-profiler — sample-based allocation profiler (HotSpot-specific, low overhead).
-Xlog:gc*(Java 9+) — GC log output for GCViewer, GCEasy, JClarity Censum.
Flags cheat-sheet
-Xmx4g -Xms4g— set max and initial heap to 4 GB (equal values avoid resize pauses).-XX:+UseG1GC/-XX:+UseZGC/-XX:+UseShenandoahGC— select collector.-XX:MaxGCPauseMillis=100— G1 pause goal.-XX:+UnlockExperimentalVMOptions— needed for some experimental flags in older JDKs.-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=...— heap dump for postmortem.
Container awareness
Since Java 10 (JEP 351 + UseContainerSupport, on by default), the JVM reads cgroup v1/v2 limits for memory and CPU. Before Java 10, the JVM saw the host’s memory and overcommitted heaps that the OOMkiller then reaped. Always set -Xmx explicitly in production containers even on modern JDKs — auto-sizing is reasonable but not always optimal.
The .NET runtime is similarly cgroup-aware since .NET Core 3.0.
Tuning workflow
- Baseline. Capture a GC log (
-Xlog:gc*,gc+heap=info,gc+ergo=info:file=gc.log) under realistic load. Run for at least one full GC cycle for each generation. - Visualize. Pipe through GCViewer, GCEasy, JClarity Censum, or read raw with the eye trained for
Pause Young,Pause Full,Concurrent Mark, allocation rate, promotion rate. - Identify the bottleneck. Long young pauses → survivor space too small. Long old/full pauses → heap too large for current collector, or generational hypothesis violated by application (long-lived caches). High concurrent CPU → collector saturated; raise heap or change collector. Frequent young GC → allocation rate high, young gen too small.
- Change one knob. Re-run. Compare. Document.
- Bake at scale. Many GC pathologies only surface at production scale (sustained allocation pressure over hours, varied object size distribution, GC-coincident traffic spikes).
25. Pitfalls
- GC pauses misdiagnosed as “the database is slow.” A multi-second STW pause looks like network or DB latency to upstream services. Always correlate GC logs with latency spikes before blaming the DB.
- String interning and global caches keeping references alive. A
static Map<String, X>grows without bound; the GC can’t collect what is reachable. Symptom: gradually rising old-gen occupancy until OOM. - Finalizers (
Object.finalize) are a footgun: non-deterministic timing, can resurrect objects, single finalizer thread is a bottleneck. Deprecated for removal (JEP 421, Java 18). Usetry-with-resources+AutoCloseableandjava.lang.ref.Cleanerfor native-resource cleanup. - Direct ByteBuffers / native memory are not tracked by the heap GC.
ByteBuffer.allocateDirectallocates outside the heap; reclaimed only when the buffer’sCleanerruns (which is GC-dependent). Production leaks here are common; tune-XX:MaxDirectMemorySizeand monitor RSS, not just heap. - Off-heap caches (Ehcache, Caffeine with
Weigher, Chronicle Map) deliberately bypass the heap to avoid GC pressure on large data sets — useful, but you’ve moved the problem to native memory management. - Tuning blindly without a profile. The first reaction “let’s raise
-Xmx” often makes pauses worse. Measure allocation rate and live set first. - Pinning for JNI / direct I/O fragments the heap and can defeat compaction. Use the
PinnedObjectHeapin .NET 5+; minimize in Java. System.gc()in production code — almost always wrong. Use-XX:+DisableExplicitGCdefensively.- Allocation in hot paths — even with bump-pointer allocation, gigabytes of allocation per second saturates the GC. Reuse buffers; consider escape analysis (HotSpot’s
-XX:+DoEscapeAnalysis, on by default, can stack-allocate non-escaping objects). WeakHashMapretention surprises — keys are weak but values referencing keys keep them alive transitively. UseCaffeinewith weak values + weak keys, orIdentityHashMapwith explicit cleanup.ThreadLocalleaks in thread pools — pooled threads outlive the request, so aThreadLocalset during one request and neverremove()-d will be retained for the lifetime of the pool. Standard cause of “the application works for an hour then OOMs” in long-running app servers. Alwaystry { ... } finally { local.remove(); }in pooled contexts.- Class loader leaks — a single retained reference into an old class loader pins every class it loaded, every static, every interned string. Hot-reload frameworks (Spring DevTools, OSGi, JRebel) hit this regularly. Heap dump and look for
ClassLoaderinstances; trace the GC root path. - JIT compilation pressure during steady state can be confused for GC —
-XX:+PrintCompilationdistinguishes JIT pauses from GC pauses. - Humongous objects in G1 — any allocation larger than half a region (default 1–32 MB) becomes a “humongous” allocation taking ≥ 1 region. Excessive humongous allocations fragment the heap and trigger early concurrent cycles. Mitigate: increase
-XX:G1HeapRegionSize, or restructure to avoid large arrays.
26. Manual memory management alternatives
When GC overhead or unpredictability is unacceptable:
- Rust — ownership transfers at move; borrows are statically checked;
Dropdeterministic. Standard for new systems software (Linux kernel modules, Firefox components, AWS Firecracker, TiKV, Polars).Box<T>for owned heap,&T/&mut Tfor borrows,Rc<T>/Arc<T>for shared ownership. - C++ (modern) — RAII (constructors acquire, destructors release);
std::unique_ptr,std::shared_ptr,std::weak_ptr; allocator-aware containers (std::pmr); arena/monotonic allocators for short-lived bursts. - Zig — explicit allocator parameter on every function that allocates;
deferfor cleanup;ArenaAllocatorfor scoped batch frees; testing allocator detects leaks. - Pony — reference capabilities (
iso,ref,val,box,tag,trn) prove statically that no two actors mutate the same object concurrently; per-actor heap; Orca distributed cycle collector. - Erlang — per-process arena (already covered) — a hybrid model where small data is GC’d inside a tiny private heap, but inter-process sharing relies on copying, eliminating cross-process GC entirely.
- D — has a GC but lets you opt out with
@nogcfunctions, manualmalloc, RAII-style structs. - Nim — default refcount + cycle collector; alternative
--mm:orc(Optimized Reference Counting with cycle collector) and--mm:arc(no cycle collector, programmer-managed). - Carbon / Mojo / Hylo — experimental successor-language efforts exploring linear types, move semantics, and effect systems as alternatives to both GC and Rust-style borrowing.
The Rust trade-off has shifted the industry’s default for new low-latency systems: pay the borrow-checker cost up front and skip the GC tax entirely. Even GC-language runtimes have absorbed the lesson — Java’s Project Valhalla (value types, primitive classes) reduces allocation pressure by letting more data live on the stack or inline in containers, narrowing the gap with manual-management languages.
When to choose GC vs manual
A pragmatic decision tree:
- Short-lived process, lots of allocation, no tail-latency SLA → GC. Python, Ruby, Node for scripting and batch.
- Long-running service, moderate heap, p99 in tens of ms → GC with a modern collector (G1 default, ZGC/Shenandoah if heap is large). Java, .NET, Go for the majority of backend services.
- Long-running service, sub-millisecond p99, large heap → ZGC, Shenandoah, or rewrite hot paths in Rust/C++.
- Embedded, real-time control, no MMU, sub-microsecond deadlines → no GC. Rust, C, Ada.
- Numeric / HPC → mostly manual (Fortran, C++, Rust); occasionally GC’d (Julia, with MMTk integration making the trade-off tunable).
- UI / client app → GC is fine for almost all cases; the human-perceptible threshold is ~16 ms (60 fps frame budget), which any modern collector meets at typical client heap sizes.
Allocator design as a first-class concern
GC and the allocator are coupled. Bump-pointer allocation (semispace / Immix / Eden) is essentially free — one register increment. Free-list allocation (mark-sweep without compaction) is slower (~10–30 ns) and worse for cache locality. Most modern collectors use thread-local allocation buffers (TLABs): each thread bump-allocates inside a private slab carved out of the shared young gen, refilling under a fast slow-path lock when exhausted. This is what makes Java/Go allocation rates of 1–10 GB/s/core feasible.
Allocators also influence GC indirectly:
- Size classes (jemalloc, tcmalloc, mimalloc, Go’s mcache/mcentral/mheap) reduce fragmentation but require a free-list per class.
- Slab / object pools at the application layer (Netty
ByteBufAllocator, HikariCP connection pool, RocksDB block cache) reduce GC pressure but introduce ownership bugs (use-after-release). - Off-heap memory is allocator territory, not GC territory — see direct buffers and Unsafe in section 25.
27. Cross-references
- _index — Compute reference index.
- concurrency-primitives — locks, atomics, the memory model GC barriers live in.
- os-scheduling-memory — virtual memory, paging, RSS, what “memory” means below the runtime.
- _index — per-language runtime references (JVM, .NET, Go, Python, JS, Rust, OCaml, Haskell).
28. Citations
- McCarthy, John. Recursive functions of symbolic expressions and their computation by machine, Part I. Communications of the ACM, 3(4), April 1960. (First mark-sweep GC, for LISP.)
- Collins, George E. A method for overlapping and erasure of lists. Communications of the ACM, 3(12), December 1960. (Reference counting.)
- Cheney, C. J. A nonrecursive list compacting algorithm. Communications of the ACM, 13(11), November 1970. (Semispace copying.)
- Dijkstra, Edsger W.; Lamport, Leslie; Martin, A. J.; Scholten, C. S.; Steffens, E. F. M. On-the-fly garbage collection: an exercise in cooperation. Communications of the ACM, 21(11), November 1978. (Tri-color marking.)
- Yuasa, Taiichi. Real-time garbage collection on general-purpose machines. Journal of Systems and Software, 11(3), 1990. (SATB barrier.)
- Lieberman, Henry; Hewitt, Carl. A real-time garbage collector based on the lifetimes of objects. Communications of the ACM, 26(6), June 1983. (Generational hypothesis.)
- Brooks, Rodney A. Trading data space for reduced time and code space in real-time garbage collection on stock hardware. LISP and Functional Programming, 1984. (Brooks forwarding pointers, used in Shenandoah.)
- Bacon, David F.; Cheng, Perry; Rajan, V. T. A real-time garbage collector with low overhead and consistent utilization. POPL 2003. (Metronome.)
- Detlefs, David; Flood, Christine; Heller, Steven; Printezis, Tony. Garbage-first garbage collection. ISMM 2004. (G1.)
- Blackburn, Stephen M.; Cheng, Perry; McKinley, Kathryn S. Myths and realities: the performance impact of garbage collection. SIGMETRICS 2004. (MMTk introduction.)
- Blackburn, Stephen M.; McKinley, Kathryn S. Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance. PLDI 2008. (Immix.)
- Tene, Gil; Iyengar, Balaji; Wolf, Michael. C4: the continuously concurrent compacting collector. ISMM 2011. (Azul C4, precursor to ZGC.)
- Tene, Gil. The Z Garbage Collector — An Introduction. JFokus 2018. (ZGC overview.)
- Jones, Richard; Hosking, Antony; Moss, Eliot. The Garbage Collection Handbook: The Art of Automatic Memory Management, 2nd edition. CRC Press, 2023. (The standard reference.)
- Gross, Sam. PEP 703 — Making the Global Interpreter Lock Optional in CPython. python.org, 2023.
- OpenJDK documentation: HotSpot Virtual Machine Garbage Collection Tuning Guide; JEPs 189 (Shenandoah), 248 (G1 default), 291 (CMS deprecation), 318 (Epsilon), 333 (ZGC experimental), 363 (CMS removal), 377 (ZGC GA), 421 (Finalizer deprecation), 439 (Generational ZGC).
- Microsoft .NET documentation: Fundamentals of Garbage Collection; Background GC; DATAS.
- The Go Programming Language Specification + Runtime; Hudson, Rick. Getting to Go: The Journey of Go’s Garbage Collector. GopherCon 2018.
- V8 blog: Concurrent marking in V8 (2018), Orinoco: young generation garbage collection (2017).
- MMTk project: mmtk.io documentation.
- Erlang/OTP documentation: efficiency guide, garbage collection.
- GHC Users Guide: RTS, the non-moving garbage collector (since GHC 8.10).
- OCaml manual: garbage collection chapter; OCaml 5 multicore design notes (Sivaramakrishnan et al.).
- Boehm, Hans-J.; Weiser, Mark. Garbage collection in an uncooperative environment. Software: Practice and Experience, 18(9), 1988. (Conservative GC.)
- Wilson, Paul R. Uniprocessor garbage collection techniques. IWMM 1992. (Survey, still useful framing.)
- Bacon, David F.; Rajan, V. T. Concurrent cycle collection in reference counted systems. ECOOP 2001. (Trial deletion cycle detection, basis for CPython’s collector improvements.)
- Click, Cliff; Tene, Gil; Wolf, Michael. The pauseless GC algorithm. VEE 2005. (Azul predecessor to ZGC.)
- Hudson, Rick. Getting to Go: The Journey of Go’s Garbage Collector. Go blog + ISMM keynote, 2018.
- Lin, Yi; Blackburn, Stephen M. et al. Rust as a Language for High Performance GC Implementation. ISMM 2016. (Foundation of modern MMTk in Rust.)
Appendix A: Glossary
- Allocation rate — bytes allocated per second by the mutator.
- Card table — bitmap or byte array indexing fixed-size regions of old gen, marked dirty on cross-generational stores.
- Collection set — the subset of regions a region-based collector evacuates in the current cycle.
- Eden — HotSpot name for the bump-pointer young-gen allocation region.
- Floating garbage — objects that became unreachable after the start of a concurrent cycle but won’t be collected until the next cycle.
- Mutator — the application code, from the collector’s perspective (it mutates the heap).
- Promotion — moving a survivor object from young gen to old gen after it survives N collections.
- Remembered set (RSet) — per-region or per-object record of incoming pointers from other regions, used to limit scanning during partial collection.
- Root set — the starting set of references for tracing: stacks, globals, registers, JNI handles, monitor-acquired objects.
- Safe-point — a program location at which the runtime’s metadata (stack maps, register maps) is consistent and the thread may be paused.
- SATB (Snapshot-At-The-Beginning) — Yuasa’s concurrent-marking technique: treat the heap snapshot at cycle start as the live set, recording overwritten references.
- STW (Stop-the-world) — a pause in which all mutator threads are halted.
- Survivor space — intermediate region in a generational collector between Eden and old gen.
- TLAB (Thread-local allocation buffer) — a per-thread slab of young-gen memory enabling lock-free bump allocation.
- Tenuring — promotion of a young-gen survivor to old gen after surviving a threshold number of collections.
- Tri-color — the white/gray/black abstraction for concurrent marking.
- Write barrier — a small piece of mutator code emitted at heap-pointer stores, used to maintain remembered sets or tri-color invariants.