Compilers and Program Analysis
A compiler translates programs in a source language to an equivalent program in a target language — typically machine code, bytecode, or another high-level language — while preserving semantics and ideally improving performance, size, or energy use. Program analysis is the static or dynamic inspection of programs to derive properties (types, dataflow facts, reaching definitions, alias relationships, complexity bounds, security properties) that drive optimization, bug-finding, or formal verification. The two disciplines share a foundation in operational and denotational semantics, lattice theory, and graph algorithms; modern toolchains (LLVM, GCC, MLIR, GraalVM, V8 TurboFan) blur the line by integrating analysis and transformation throughout the compilation pipeline. This note covers the canonical pipeline, the intermediate representations and analyses that dominate practice, the JIT and polyhedral specializations, the analysis frameworks for bug-finding and security, and the formally verified compilers that bridge to proof assistants.
See also
- compiler-design — Tier-1 companion on lexing, parsing, codegen.
- formal-verification-and-fuzzing — companion on SMT solvers and verification methods.
- concurrency-primitives — memory models that compilers must respect.
- garbage-collection — runtime memory management produced by compiler analyses.
- cpu-cache-performance — target microarchitecture features compilers exploit.
- cuda-triton-gpu-programming — GPU compiler stacks (NVCC, Triton).
- fpga-and-hardware-acceleration — HLS toolchains compile to RTL.
- databases-internals-deep — query compilers as a domain-specific application.
1. The phases of compilation
A compiler is conventionally decomposed into a front end (source-language-specific), middle end (language-agnostic optimization), and back end (target-machine-specific). The decomposition allows reuse: a single front-end-per-source-language and a single back-end-per-target combine via a shared intermediate representation (IR).
1.1 Lexical analysis (scanning)
A finite-state automaton classifies the source character stream into a token stream. Tokens carry a kind tag (identifier, integer literal, keyword, operator) and a lexeme. The lex/flex tools generate scanners from regular-expression specifications. Hand-written scanners dominate in production compilers because they handle context-sensitive cases (raw string literals in C++/Python, template >> disambiguation in C++, layout-sensitive Haskell/Python) more naturally than generated code.
1.2 Syntax analysis (parsing)
A context-free grammar describes the syntactic structure. Parsing algorithms in current production use:
- Recursive descent / LL(k) / Pratt. Hand-written, used by Clang, Roslyn (C#), Rust’s rustc, Go’s
go/parser. Pratt parsers (Pratt 1973) handle precedence elegantly. - LALR(1) generated. Yacc/Bison still common in scientific code and many DSLs.
- GLR / Earley. Tolerate ambiguous grammars; used in some natural-language toolchains, DMS Software Reengineering Toolkit, Tree-sitter (incremental GLR).
- PEG (Parsing Expression Grammar). Ford 2004. Packrat-parsing variants. Used by Python’s PEG parser since 3.9 (PEP 617) replacing the prior LL(1) generator.
- Combinator-style. Haskell
parsec,megaparsec; not for compilers proper but ubiquitous in DSL implementation.
The output is an abstract syntax tree (AST). Modern compilers preserve source locations and may produce a concrete syntax tree (CST) for tooling that requires whitespace fidelity (formatters, refactoring).
1.3 Semantic analysis
Name resolution binds identifiers to declarations; type checking verifies type compatibility and elaborates implicit conversions; static check passes (definite assignment, unreachable code, exhaustiveness of pattern matches) emit warnings or errors. The type checker is the largest semantic component for statically typed languages; for Hindley-Milner-style inference (ML, Haskell, OCaml), Algorithm W (Damas-Milner 1982) or constraint-based variants (e.g., Pottier-Rémy 2005, OutsideIn(X) in GHC) generate and solve equality and class constraints. Bidirectional type checking (Pierce-Turner 2000) is now standard in dependently typed and gradually typed systems and in many high-quality elaborators (Idris, Lean 4, Rust trait coherence).
1.4 Intermediate representation generation
The AST is lowered to a graph or block-based IR. The dominant in-memory IRs:
- LLVM IR. Three-address SSA bitcode plus textual form. Per-function CFG of basic blocks of typed instructions; ~70 opcodes (load, store, getelementptr, icmp, br, phi, call, invoke, …). Single static-assignment via explicit phi nodes; metadata for debug info, profile info, type-based alias analysis. Standard for LLVM-using compilers (Clang, Rust, Swift, Julia, OpenCL via SPIR-V→LLVM, NVIDIA NVPTX, AMD ROCm).
- GCC GIMPLE. Three-address tuple form on a per-function CFG. Predates LLVM IR (Novillo 2003 GCC summit paper).
- MLIR (Multi-Level Intermediate Representation). Lattner-Amini-Bondhugula-Cohen-Davis-Pienaar-Riddle-Shpeisman-Vasilache-Zinenko 2021 CGO extension to LLVM that permits user-defined dialects (LinAlg, Affine, GPU, TOSA, ONNX, TensorFlow Toco, Triton, CIRCT for hardware) sharing a common infrastructure. Used in TensorFlow XLA, Triton, IREE, ONNX-MLIR, Flang (LLVM Fortran).
- Java bytecode and CIL. Stack-machine virtual-machine IRs used by JVM (HotSpot, OpenJ9, GraalVM) and .NET CLR.
- WebAssembly (Wasm). Structured-control-flow stack-machine IR designed for sandboxed execution in browsers and increasingly server-side runtimes (Wasmtime, Wasmer, WasmEdge). Wasm 2.0 added SIMD, threads, exception handling, reference types; WASI provides system-interface conventions.
1.5 Optimization passes
Mid-end passes operate on IR. The LLVM opt tool maintains a configurable pass pipeline with three default levels (-O1, -O2, -O3) plus -Os (size) and -Oz (aggressive size). Major analyses and transformations:
- Constant propagation, constant folding, sparse conditional constant propagation (SCCP). Wegman-Zadeck 1991 TOPLAS 13. Discover that a variable holds a constant and substitute.
- Common subexpression elimination (CSE), global value numbering (GVN). Identify equivalent computations and eliminate redundancy.
- Dead code elimination (DCE), aggressive dead code elimination (ADCE). Remove instructions whose results are unused.
- Inlining. Replace function calls with bodies. Profile-guided heuristics, function summaries, and call-graph SCC traversal.
- Loop optimizations. Loop-invariant code motion (LICM), induction variable simplification, strength reduction, loop unrolling, loop fusion, loop interchange. Modern frameworks: LLVM’s LoopPass infrastructure; GCC’s tree-loop and graphite (polyhedral) layer.
- Scalar replacement of aggregates (SROA, “mem2reg + register splitting”). Convert pointer-spilled local variables to SSA registers.
- Vectorization. Autovectorize loops or basic blocks to SIMD instructions (SLP — Superword-Level Parallelism, Larsen-Amarasinghe 2000; LoopVectorize). Critical for x86 AVX/AVX-512, ARM NEON/SVE/SVE2, RISC-V V extension.
- Devirtualization. Resolve indirect calls to direct ones via type analysis.
- Tail call optimization.
- Link-time optimization (LTO). Whole-program analysis at link time. ThinLTO (Tobias Grosser-Reames 2016 LLVM Developer Meeting) provides scalable summary-based variant.
- Profile-guided optimization (PGO). Instrument or sample to gather runtime profile, then re-compile with profile information. AutoFDO (Chen-Li-Hundt 2016 CGO) uses sampling-based profiles. PGO + LTO combined yield 5–25% performance improvement on large server workloads (Google reports).
1.6 Code generation
The back end lowers IR to a target-specific machine IR (LLVM’s MachineInstr, GCC’s RTL), performs register allocation, instruction scheduling, and emits assembly or object code.
- Instruction selection. Match IR patterns to target instructions. LLVM uses SelectionDAG (DAG-based) for many targets, GlobalISel (block-based, faster) for newer targets. Tablegen DSL defines patterns.
- Register allocation. Map unlimited virtual registers to limited physical registers. Approaches: graph-coloring (Chaitin et al. 1981 PLDI original; Briggs 1992 thesis; Chaitin-Briggs); linear-scan (Poletto-Sarkar 1999 TOPLAS 21; used in HotSpot client compiler, .NET JIT); PBQP (Hames-Scholz 2006); SSA-based (Hack-Goos 2006). LLVM uses greedy register allocation by default.
- Instruction scheduling. Reorder instructions to expose ILP and minimize stalls. List scheduling, software pipelining (Lam 1988 PLDI).
- Object/assembly emission. ELF, Mach-O, PE/COFF formats; DWARF debug info; relocations.
2. Static single assignment
SSA form requires that every variable has exactly one definition. At control-flow join points, φ functions select the value depending on the predecessor edge. Originally Rosen-Wegman-Zadeck 1988 POPL and Cytron-Ferrante-Rosen-Wegman-Zadeck 1991 TOPLAS 13 (the “CFRWZ” paper). SSA properties:
- Use-def chains are explicit and unique: each variable use refers to exactly one definition.
- Many analyses simplify under SSA: SCCP, GVN, LICM, register allocation.
- Minimal SSA construction via dominance frontiers (Cytron et al.); pruned SSA elides φ’s for variables not live at the join; semi-pruned SSA is a compromise. Briggs-Cooper-Harvey-Simpson 1998 Software—Practice and Experience 28.
- SSA destruction (out-of-SSA): φ functions cannot run on hardware. Lower φ’s to copies. Sreedhar-Ju-Gillies-Santhanam 1999 CC parallel copy semantics. Boissinot-Hack-Grund-Boucheron-Rastello 2009 CC introduced the canonical algorithm.
- Memory SSA (LLVM MemorySSA): virtual SSA over memory accesses for alias-aware optimization.
- Region SSA, extended SSA, gated SSA: variants used in particular pipelines.
Modern compilers (LLVM, GCC, V8 TurboFan, HotSpot C2, JikesRVM, .NET RyuJIT) work in SSA throughout the mid-end.
3. Dataflow analysis and lattice theory
Static analyses framed as computing fixed points on lattices of program-state abstractions. Kildall 1973 POPL unified the framework.
3.1 The monotone framework
A dataflow problem is a tuple (L, F, init) where L is a complete lattice of facts, F is a set of monotone transfer functions (one per program point), and init is an entry fact. Iterate the equations out[s] = f_s(in[s]) and in[s] = ⊔_p out[p] until fixed point. Convergence is guaranteed if L has finite height; otherwise use widening.
Canonical problems:
- Reaching definitions (forward, may). Which definitions might still be active at a point.
- Available expressions (forward, must). Which expressions are guaranteed already computed.
- Live variables (backward, may). Which variables might be used later.
- Very busy expressions (backward, must).
- Constant propagation (forward, must). Variables holding compile-time-known values.
- Reaching constants (interval domain). Track integer intervals.
- Pointer analysis. Track possible pointer targets; Andersen 1994 thesis (inclusion-based, O(n^3) worst case, ~quadratic in practice with cycle elimination) and Steensgaard 1996 POPL (unification-based, near-linear, less precise). Hardekopf-Lin 2007 PLDI offline cycle detection. Whaley-Lam 2004 PLDI binary decision diagram (BDD) representation.
3.2 Interprocedural analysis
Function calls and returns require either context-sensitive (different facts per call site) or context-insensitive analysis. Approaches:
- Call strings. Sharir-Pnueli 1981. Index facts by k-bounded call sequences (k-CFA).
- Functional summaries. Reps-Horwitz-Sagiv 1995 POPL IFDS — Interprocedural, Finite, Distributive, Subset — and its IDE extension. Tabulation algorithm achieves O(ED^3) where E is interprocedural edges and D is fact domain. Implemented in WALA, Soot, Heros, Phasar.
- Type-based and points-to-based context sensitivity.
3.3 Abstract interpretation
Cousot-Cousot 1977 POPL and subsequent papers framed program analysis as approximating concrete semantics by abstract semantics in a Galois connection. The lattice (L, ⊑) of abstract elements relates to the concrete domain via abstraction α and concretization γ.
Domains used in practice:
- Sign, parity, constant. Trivial.
- Interval [Cousot 1976]. Track [min, max] per variable.
- Octagon [Miné 2006 HOSC]. Conjunctions of
x ± y ≤ c. Cubic time per operation. - Polyhedra [Cousot-Halbwachs 1978]. Linear inequalities. Exponential worst-case, used by Astrée.
- Numerical relational domains. Apron library bundles octagon, polka, taylor, polyhedra.
- Shape analysis. Separation logic (O’Hearn-Reynolds-Yang 2001 CSL) and TVLA (Sagiv-Reps-Wilhelm 2002 TOPLAS 24) reason about heap structure.
Astrée (Cousot-Cousot-Feret-Mauborgne-Miné-Monniaux-Rival 2005) is the canonical production abstract interpreter; verified absence of runtime errors in Airbus A380 fly-by-wire C code. Infer (Calcagno-Distefano-O’Hearn-Yang 2009 NFM 5503 PLDI 2015; Facebook/Meta open-source) uses bi-abduction inference of separation-logic preconditions.
3.4 Partial evaluation
Specialize a program with respect to a known subset of its inputs to produce a residual program. Jones-Sestoft-Søndergaard 1985, Futamura projections 1971: 1st projection = compilation; 2nd projection = compiler generator from interpreter; 3rd projection = compiler generator generator. Practical instances: Common Lisp compiler-macros, partial evaluation in PyPy’s RPython toolchain to generate tracing JITs, Truffle/Graal partial evaluator that compiles AST interpreters to optimized machine code.
4. JIT compilation
Just-in-time compilation delays code generation until execution, allowing optimization based on runtime profiles and on data layouts unknowable statically.
4.1 Tracing JITs
Record the sequence of instructions executed on a hot path; compile the trace as a single optimized block with guards at branch points. If a guard fails, fall back to interpretation. Bala-Duesterwald-Banerjia 2000 (Dynamo native-to-native), Gal-Eich-Shaver-Anderson-Mandelin-Haghighat-Kaplan-Hoare-Zbarsky-Orendorff-Ruderman-Smith-Reitmaier-Bebenita-Chang-Franz 2009 PLDI (TraceMonkey, Mozilla), PyPy (Bolz-Cuni-Fijalkowski-Tratt-Yamaki 2009 ICOOOLPS). LuaJIT (Mike Pall, 2005–) was a tracing JIT design landmark for dynamic language performance.
4.2 Method-based JITs
Compile entire methods/functions on first execution or after a hotness threshold. Multi-tier strategy: interpreter → baseline JIT (fast compile, slow code) → optimizing JIT (slow compile, fast code) with on-stack replacement (OSR) for tier transitions.
- HotSpot (Oracle Java). C1 (client compiler, tier 1) → C2 (server compiler, tier 4); Graal as optional replacement for C2. ~80 optimization phases in C2. Escape analysis, scalar replacement, intrinsics for Math, String, crypto.
- V8 (Chromium JavaScript/WebAssembly). Ignition interpreter → Sparkplug baseline JIT → Maglev mid-tier (added 2023) → TurboFan optimizing. Inline caches (ICs) for property accesses with hidden classes (Chambers-Ungar-Lee 1989 OOPSLA SELF heritage). Liftoff (Wasm baseline) → TurboFan (Wasm optimizing).
- V8 TurboFan. Speculative optimization on type feedback; bailout deoptimization on guard failures. Sea-of-nodes IR with floating instructions and explicit control dependencies (Click-Paleczny 1995). Recently displacing toward Maglev for many workloads.
- SpiderMonkey (Firefox). Interpreter → Baseline → IonMonkey → WarpMonkey (2021+, replaced IonMonkey’s Type Inference with Cache-IR-based snapshots).
- JavaScriptCore (Safari). LLInt (low-level interpreter) → Baseline JIT → DFG → FTL (Faster Than Light) JIT. FTL initially used LLVM, replaced with custom B3 backend (Bytecode-Backend-B3) since 2016.
- .NET RyuJIT (and Tiered Compilation). Two-tier: Tier 0 (quick) → Tier 1 (optimized); R2R (ReadyToRun) ahead-of-time compilation precompiles popular code.
- GraalVM Compiler. Substrate VM provides AOT (
native-image) or runs as JIT. Truffle framework defines language interpreters that are partially evaluated into optimized code; Polyglot interop between Java, JS, Python, Ruby, R, LLVM-IR, WASM.
4.3 Inline caches and shape analysis
Dynamic-language objects vary in field layout per instance. Inline caches monomorphize call sites by recording the observed receiver shape (“hidden class” in V8, “structure ID” in JSC). Polymorphic IC chains; megamorphic fallback. Chambers-Ungar 1990 PLDI maps for SELF.
5. Polyhedral compilation
The polyhedral model (Feautrier 1991, 1992 International Journal of Parallel Programming; Pugh 1991 Omega test; Kelly-Pugh 1995) represents nested loop iterations with affine bounds as integer points in a parametric polyhedron and statement instances as integer points in the iteration domain. Dependences are expressed as integer affine relations; transformations (loop fusion, fission, interchange, tiling, skewing) become applications of affine schedules to the iteration domain.
Tools:
- isl (Integer Set Library, Verdoolaege 2010). Standalone library; powers most other polyhedral compilers.
- Polly (Grosser-Größlinger-Lengauer 2012, LLVM project). Polyhedral optimizer integrated into LLVM.
- PoCC, PluTo (Bondhugula-Hartono-Ramanujam-Sadayappan 2008 PLDI). Automatic tiling and parallelization framework. PLuTo’s affine scheduling algorithm minimizes communication and exposes parallelism.
- GCC Graphite. Originally based on PLuTo and CLooG; integrated polyhedral framework in GCC.
- MLIR Affine dialect. Polyhedral abstraction in MLIR enabling polyhedral transformations on TF/Torch/Triton workloads.
- Tiramisu (Baghdadi-Ray-Romdhane-DelSozzo-Akkas-Zhang-Suriana-Kamil-Amarasinghe 2019 CGO). Polyhedral DSL for HPC.
- Halide (Ragan-Kelley-Adams-Paris-Levoy-Amarasinghe-Durand 2013 PLDI). Separation of algorithm and schedule for image processing; partial polyhedral integration via auto-scheduler.
- TVM (Chen-Moreau-Jiang-Zheng-Yan-Cowan-Shen-Wang-Hu-Ceze-Guestrin-Krishnamurthy 2018 OSDI). Tensor compiler; later auto-scheduler (Ansor) and MetaSchedule.
- Tensor Comprehensions (Vasilache-Zinenko-Theodoridis-Goyal-DeVito-Moses-Verdoolaege-Adams-Cohen 2018). Discontinued but influential.
The polyhedral model excels at affine, dense, regular code (BLAS, stencil computations, image processing); extensions to non-affine or sparse code via overapproximation (Polly’s “delicate model”) and Polyhedral+ approaches remain research-active.
6. Datalog and modern static analysis frameworks
Many program analyses can be expressed as Datalog programs over relations representing the program. Edge cases: subtyping, virtual dispatch, exceptions, reflection. The Doop framework (Bravenboer-Smaragdakis 2009 OOPSLA; Smaragdakis-Bravenboer 2010; Allen-Bravenboer-Hubert-Kheradmand-Krishnamoorthy-Smaragdakis 2015) encoded Java points-to analysis in Datalog with millions of context-sensitive variants outperforming hand-written analyses.
6.1 Soufflé
Jordan-Scholz-Subotic 2016 CC compiles Datalog to native code via C++ specialization and parallel evaluation; scales to billion-tuple analyses. Used by Doop, Vandal (Ethereum decompilation), Gigahorse, the Securify smart-contract analyzer, and Amazon CodeGuru.
6.2 LogiQL, ddlog, Formulog
Domain-specific Datalog extensions. Differential Datalog (ddlog, Ryzhyk-Budiu 2019 Datalog 2.0) provides incremental computation.
6.3 CodeQL
Semmle/GitHub query language for code analysis. Used by GitHub’s CodeQL platform for vulnerability research; supports C/C++, Java, JavaScript/TypeScript, Python, C#, Ruby, Swift, Go. Underpins many CVE discoveries.
6.4 Semgrep, SonarQube, Coverity, Klocwork
Pattern-matching style (Semgrep, Coccinelle) and full-program-analysis (Coverity, Klocwork, Synopsys) commercial tools. Coverity’s “Many Sources, Many Sinks” style of tainted-flow analysis remains canonical.
6.5 LLVM analysis pipeline
Built-in: TBAA, ScalarEvolution, MemorySSA, BasicAA, CFL-AA, DependenceAnalysis, BranchProbabilityInfo, BlockFrequencyInfo, MemoryDependenceAnalysis. External: SVF (Sui-Xue 2016 CC) for value-flow and points-to; Phasar (Schubert-Hermann-Bodden 2019 CC).
7. Compiler testing and fuzzing
Compilers are themselves software and contain bugs that produce miscompiled programs — silently wrong code is the worst class of bug.
7.1 Csmith
Yang-Chen-Eide-Regehr 2011 PLDI generates random C programs that exercise compiler behavior, comparing outputs across GCC, Clang, ICC, and others via differential testing. Found 325+ bugs in production compilers in the year following publication.
7.2 YARPGen
Livinskii-Babokin-Regehr 2020 OOPSLA generates programs designed to defeat optimizing compilers; uses target-aware generation to stress particular optimization passes. Continued to produce bug reports against GCC, Clang, ICC, MSVC.
7.3 Equivalence Modulo Inputs (EMI)
Le-Afshari-Su 2014 PLDI mutates a program by inserting “dead” code (unreachable from the inputs of a test run) and checks that the mutated program produces the same output; differences indicate a miscompilation. Found many GCC and LLVM bugs.
7.4 Differential testing and bisecting
The standard workflow: compile a corpus of programs with multiple compilers/optimization levels, diff outputs, bisect bug to a particular pass. LLVM bugpoint, llvm-reduce, creduce (Regehr-Chen-Cuoq-Eide-Ellison-Yang 2012 PLDI) for reducing failing inputs.
7.5 SMT-based superoptimization
Souper (Sasnauskas-Regehr et al. 2017) and Alive2 (Lee-Lopes-Regehr et al. 2021 PLDI) verify or synthesize peephole optimizations using SMT solvers; have found bugs in LLVM InstCombine.
8. Formal verification of compilers
End-to-end verification establishes the central correctness claim — observable behavior is preserved across translation — in a proof assistant.
8.1 CompCert
Leroy 2009 Communications of the ACM 52 — CompCert is a moderately optimizing C compiler verified in Coq. Eight intermediate languages with semantic-preservation proofs between each pair. Targets PowerPC, ARM, x86, RISC-V. Used in safety-critical avionics and rail (Airbus, MTU Aero Engines). Subject of the Yang-Chen-Eide-Regehr Csmith study: zero miscompilations found in CompCert’s verified middle end (bugs found in unverified parsers and runtime).
8.2 CakeML
Kumar-Myreen-Norrish-Owens 2014 POPL — CakeML is a verified compiler from a subset of Standard ML to ARMv8, x86-64, MIPS, RISC-V. Bootstrapping: CakeML compiler is itself written in CakeML and verified. Built in HOL4. Underpins the verified theorem prover CertiCoq for Coq’s Gallina.
8.3 Vellvm / Verified-LLVM
Zhao-Nagarakatte-Martin-Zdancewic 2012 POPL formalizes LLVM IR semantics in Coq. Subsequent work proves correctness of selected LLVM optimization passes.
8.4 Bedrock and Fiat
Chlipala 2015 POPL and follow-ups develop verified low-level systems via embedded-DSL approach in Coq.
8.5 SeL4
Klein-Elphinstone-Heiser-Andronick-Cock-Derrin-Elkaduwe-Engelhardt-Kolanski-Norrish-Sewell-Tuch-Winwood 2009 SOSP verified microkernel in Isabelle/HOL. Compilation correctness via translation validation rather than verified compiler.
8.6 RustBelt and Iris
Jung-Jourdan-Krebbers-Dreyer 2018 POPL — RustBelt formalizes Rust’s safety guarantees in the Iris separation logic framework. Has uncovered (and led to fixes of) soundness bugs in real Rust libraries.
8.7 Translation validation
Verify that a particular compilation run is correct, rather than proving the compiler always correct. Pnueli-Siegel-Singerman 1998 TACAS foundational paper. Necula 2000 PLDI for GCC. CompCertO uses translation validation for some passes. Practical for industrial compilers where full verification is prohibitive.
9. Domain-specific compilers and modern stacks
9.1 ML and tensor compilers
- XLA (Accelerated Linear Algebra, Google). Compiles TensorFlow/JAX/PyTorch (via OpenXLA) graphs to TPU, GPU, CPU. HLO IR; recently re-architected on MLIR (StableHLO dialect).
- TVM. Open-source tensor compiler; multi-target codegen, autotuning, RPC-based remote benchmarking.
- Triton (Tillet-Kung-Cox 2019 MAPL). Python-embedded GPU kernel DSL; compiles to PTX/AMDGPU/MTL via LLVM/MLIR. Now backbone of much PyTorch 2.0
torch.compileintegration. - Mojo (Modular). Python-superset language with MLIR-based compiler targeting CPUs, GPUs, accelerators.
- OpenAI Triton, IREE, Tachy, NVFuser, Hidet, Tenstorrent’s tt-metalium, AWS Neuron compiler. Active ecosystem.
- TorchInductor (PyTorch). Triton-based backend for
torch.compile. - JAX MLIR / StableHLO. Standard interchange format for ML model graphs.
9.2 Hardware-description compilers
- High-Level Synthesis (HLS). Vitis HLS (Xilinx/AMD), Intel HLS, LegUp, Catapult. Compile C/C++/SystemC subsets to RTL (Verilog/VHDL). Allocate, bind, and schedule operations to FPGA resources with II/latency constraints.
- Chisel (Bachrach-Vo-Richards-Lee-Waterman-Avizienis-Wawrzynek-Asanovic 2012 DAC). Scala-embedded HDL; compiles to Verilog via FIRRTL IR. Used in SiFive, Western Digital SweRV, much of academic RISC-V.
- CIRCT (Circuit IR Compilers and Tools). LLVM-style hardware compiler infrastructure; FIRRTL, ESI, Calyx dialects in MLIR.
- Calyx (Nigam-Atapattu-Thomas-Pinckney-Yoo-Sampson-Wijesinghe-Maliszewski-Hutson-Karandikar-Sampson 2021 PLDI). IL for accelerator design.
9.3 Cryptographic compilers
- Cryptol, SAW. Galois suite for cryptographic specification and verification.
- Fiat-Cryptography (Erbsen-Philipoom-Gross-Sloan-Chlipala 2020 S&P). Generates verified field-arithmetic implementations used by BoringSSL/curve25519, Mozilla NSS.
- CompCert-style for cryptography: CertiCrypt, EasyCrypt.
9.4 Database query compilers
DuckDB, MonetDB, ClickHouse all use vectorized execution rather than codegen, but HyPer (Neumann 2011 VLDB), Spark Tungsten, Snowflake, and Umbra (Neumann-Freitag 2020 CIDR) generate machine code per query — connections to compilers proper. LLVM is the usual backend.
9.5 Probabilistic and differentiable compilers
Stan, Pyro, Edward2 compile probabilistic models to gradient-providing IR. Enzyme (Moses-Churavy 2020 NeurIPS) performs reverse-mode automatic differentiation at the LLVM IR level, often producing better-performing derivatives than source-level AD frameworks.
10. Memory models and the C++ revolution
The compiler must respect the source-language memory model, which permits or forbids certain reorderings of memory operations. C++11 introduced an explicit memory model (Boehm-Adve 2008 PLDI) with acquire/release/relaxed/seq_cst orderings, mirrored in C11, Java (JSR-133, Manson-Pugh-Adve 2005 POPL), and Rust. The interplay between source memory models, IR optimizations, and target memory models is subtle:
- DRF-SC (Data-Race-Free Sequential Consistency). Programs without data races execute as if sequentially consistent. Compiler must preserve race-freedom.
- TSO (Total Store Ordering). x86, SPARC: store buffer reordering, no other reorderings. Compilers can emit weakened atomics.
- Weakly-ordered architectures. ARM, POWER, RISC-V: require explicit fences for most orderings.
- Out-of-thin-air (OOTA) values. A subtle hole in early C++/Java memory models permitting paradoxical executions; remains incompletely resolved (Batty-Memarian-Owens-Sarkar-Sewell 2011 POPL; Boehm-Demsky 2014).
- Compiler memory-model bugs. Several documented bugs in LLVM/GCC around fence elimination; PSarkar-Owens-Sewell community formal models continue to find issues.
Modern fence emission is co-designed with hardware: ARMv8.1 LSE, ARMv8.4 LDAPR, x86 PAUSE/MFENCE, RISC-V Ztso etc.
11. Pointer analysis and alias analysis
Many optimizations require knowing that two memory accesses cannot interfere.
- Type-based alias analysis (TBAA, Diwan-McKinley-Moss 1998 PLDI). Two accesses through statically incompatible types do not alias. C strict aliasing rule. LLVM uses TBAA metadata.
- Andersen-style (inclusion-based). Subset-constraint propagation; cubic worst case. Hardekopf-Lin’s “Lazy Cycle Detection” achieves practical near-linear performance on million-line programs.
- Steensgaard (unification-based). Almost-linear time; coarser. Used in GCC’s basic AA.
- CFL-AA (Zhang-Yang-Wang-Lhotak-Wang 2014 POPL). Context-free-language reachability formulation; LLVM has a CFL-AA pass.
- Flow-sensitive points-to. Tracks per-program-point points-to sets; expensive but precise.
- Demand-driven, on-the-fly, sparse SSA-based. SVF (Sui-Xue 2016 CC) provides multiple analyses.
Industry practice: combine TBAA + lightweight flow-insensitive AA + scalar/loop-local must-not-alias from SCEV; defer harder cases to summary-based interprocedural analyses in LTO.
12. Modern challenges
- Sound concurrency optimization. Out-of-thin-air, sequencing across releases/acquires.
- Speculative execution and Spectre. Compilers must reason about microarchitectural side channels. Compiler-inserted Spectre mitigations: LFENCE insertion (msvc /Qspectre), retpoline (Google 2018), x86 LVI mitigations. Performance cost can be substantial.
- Heterogeneous compilation. Single-source SYCL, OpenMP target offload, Kokkos, RAJA: dispatch to CPU, GPU, FPGA from one program. Verification of partition correctness is open.
- Optimization for ML / autotuning. Search-based scheduling (Halide auto-scheduler, Ansor, Bolt, MetaSchedule) competes with hand-written schedules.
- Quantum compilation. Qiskit Terra, Cirq, t|ket⟩, Quipper compile high-level quantum algorithms to gate-level circuits with topology-aware mapping and noise-aware optimization.
- Probabilistic programming. Compilers for probabilistic programs (e.g., Stan, Pyro, Birch, Gen) produce gradient code; intersection with traditional optimization is research-active.
- Resource analysis. Worst-case execution time (WCET) tools (aiT, Bound-T) and resource-bound analyses (RaML, Hoffmann-Aehlig-Hofmann 2012) provide quantitative guarantees.
13. Major compiler infrastructure projects
- LLVM (Lattner-Adve 2004 CGO). Industry-dominant infrastructure. Active sub-projects: Clang, Clang-Tidy, Clang-Format, libc++, libc++abi, compiler-rt, lld, LLD, Polly, MLIR, Flang, MLGO.
- GCC (Stallman 1985+). Ahead of LLVM in language support breadth (Ada, D, Modula-2, GNAT). Still dominant in Linux distros’ default toolchain.
- Roslyn (.NET Compiler Platform). Microsoft’s open-source C#/VB compiler infrastructure; APIs for IDE tooling.
- Java compilers (javac, ecj). Plus the JVM JIT stack discussed above.
- GraalVM (Würthinger et al.). Java + Truffle + native-image AOT + Native-Image-bundled libraries.
- Rust (rustc). Frontend in Rust, MIR (mid-level IR) for borrow checking and basic optimizations, then translates to LLVM. Cranelift back-end is an alternative aimed at compile-speed.
- Swift Compiler (Lattner 2014+). Source → AST → Sema → SIL (Swift IL) → LLVM. SIL adds high-level optimizations and ownership tracking.
- Cranelift. Pure-Rust code generator, embedded in Wasmtime, used in some compile-speed-sensitive workloads.
- MLton. Whole-program optimizing Standard ML compiler; aggressive monomorphization.
- Stalin. Whole-program optimizing Scheme compiler; aggressive flow analysis.
14. Performance practice and microbenchmarking
Modern compiler engineering relies heavily on disciplined benchmarking.
- LLVM Test Suite, SPEC CPU 2017, MLPerf, DaCapo, Geekbench. Standard suites.
- Benchmarking tools. Google Benchmark, Hyperfine, perf, VTune. Hyperfine and Criterion (Rust) handle warmup, statistics.
- Performance regression infrastructure. LNT (LLVM Nightly Test), Clang’s
cmake -DLLVM_BUILD_BENCHMARK_TOOLS, Apple’s Compiler Performance group, Google’s MLGO research using reinforcement learning to drive heuristic decisions (inlining, register allocation). - Compiler explorers. godbolt.org (Compiler Explorer) is the de-facto standard for inspecting codegen across compilers and versions.
- Reproducible compilation. Determinism is required for caching (sccache, ccache, distcc, Bazel) and reproducible-build initiatives.
15. Garbage collection and runtime systems
Memory-safe language runtimes interleave compiler-emitted code with collector code. Standard collectors:
15.1 Tracing collectors
- Mark-sweep (McCarthy 1960 LISP). Reach all live objects from roots, mark them, sweep unmarked.
- Copying / Cheney’s algorithm (Cheney 1970 Communications of the ACM 13). Two semispaces; copy live objects from From-space to To-space, swap; allocation by bump-pointer in To-space.
- Generational (Lieberman-Hewitt 1983; Ungar 1984). Most objects die young; segregate young from old, collect young frequently and cheaply. Standard in Java HotSpot G1, ZGC, Shenandoah; .NET CLR; V8 Scavenger + Major GC.
- Concurrent / incremental. Run collector threads alongside mutator. Read or write barriers maintain invariants. ZGC and Shenandoah achieve <10 ms pause times on multi-TB heaps. Go’s tricolor concurrent collector.
- Region-based. Tofte-Talpin 1997 Information and Computation 132; SML/NJ regions, Cyclone, Rust’s lifetime system (a static-analysis analogue).
- Reference counting. Increments/decrements on pointer assignment. Used by CPython, Swift (ARC), Objective-C (ARC), COM. Cycles require auxiliary tracing pass (CPython gc module).
15.2 Compiler integration
GC requires that the compiler emit accurate stack maps identifying live pointers at safe points, and read/write barriers maintaining invariants (Yuasa barrier, Dijkstra barrier, snapshot-at-the-beginning, incremental update). Coordinated co-design with code generation; getting it right is a major subsystem in any managed runtime.
15.3 Modern collectors
- HotSpot G1 (since JDK 7). Region-based, predictable pause time goal; default since JDK 9.
- ZGC (since JDK 11). Concurrent, sub-millisecond pause goal even on TB heaps. Coloured pointer scheme.
- Shenandoah (Red Hat, since JDK 12). Brooks-pointer concurrent collector.
- Go’s collector. Tricolor concurrent mark-sweep; <500µs typical pauses.
- .NET Server GC. Multi-heap concurrent generational.
- Epsilon (JDK). No-op collector for short-lived processes.
See garbage-collection for detailed treatment.
16. Whole-program optimization and link-time
LTO performs cross-translation-unit optimization. Implementations:
- GCC LTO (since 4.5). Inter-procedural inlining, devirtualization, DCE.
- LLVM LTO (since 2009). Default LTO is monolithic — merge all IR at link time, optimize. Memory-prohibitive for very large programs.
- ThinLTO (Tobias Grosser-Reames 2016). Per-module function summaries indexed at link time; parallel cross-module inlining. Used by Chromium, Firefox, Android, iOS.
- BOLT (Panchenko-Auler-Nell-Schultz-Strunkov-Yapeev 2019 CGO). Post-link binary optimizer using profile data to reorder basic blocks, hot/cold partitioning, inline cache speculation. Adopted by Facebook, Google for server binaries.
- Propeller (Tallam-Mahesh-Hubicka 2020 LLVM Developer Meeting). Compiler-driven BB-layout alternative to BOLT.
PGO instrumentation: GCC -fprofile-generate/-fprofile-use; LLVM -fprofile-instr-generate/-fprofile-instr-use; AutoFDO from sample profiles. Production reports (Google, Microsoft, Facebook) show 5–15% performance improvement from combined PGO + LTO + BOLT, sometimes more on call-heavy code.
17. Software supply chain and reproducible builds
Compilers are critical infrastructure; the trusted computing base for any binary includes the compiler used to produce it. Ken Thompson’s 1984 Turing lecture “Reflections on Trusting Trust” demonstrated the principle.
17.1 Reproducible builds
The Reproducible Builds project (reproducible-builds.org) aims for bit-identical outputs from the same source. Requires deterministic compilation (stable hash maps, ordered symbol emission, no timestamp embedding) and toolchain pinning. Adopted by Debian, NixOS, Bazel-built workflows. Recent attacks (xz utils 2024 backdoor, CVE-2024-3094) emphasize the importance of reproducible verification.
17.2 Diverse double-compilation
Wheeler 2009 “Countering Trusting Trust through Diverse Double-Compiling” — compile the same source with two unrelated compilers; if both produce equivalent output (after a self-bootstrap), no compiler-level Trojan is present. Practical for small/medium compilers.
17.3 Bootstrapping
Self-hosting compilers (GCC, Clang, rustc) require a previous-stage compiler. The bootstrap-chain reaches back through previous compiler versions; the “Stage0” effort and Bootstrappable Builds (Janneke Nieuwenhuizen et al.) reduces the chain to a hand-auditable seed.
17.4 SBOM and supply-chain transparency
CycloneDX, SPDX standards for Software Bill of Materials. NTIA/CISA federal SBOM requirements (Executive Order 14028, May 2021). Compilers integrate into supply-chain attestation tools (SLSA framework, Sigstore signing).
18. Open problems
- Phase ordering. No known optimal ordering of optimization passes; ML-driven phase ordering (Sandya-Lin-Vasilakis-Petrov-Sutton-Lopes-Berger 2021; Google’s MLGO) is an active research front.
- Cross-language LTO. Effective optimization across language boundaries (C/Rust/C++/Swift) within LLVM-based stacks remains uneven.
- Sound optimization of safe vs unsafe code. Rust’s safety model vs LLVM IR’s undefined-behavior semantics generates ongoing friction; Alive2-style verification helps.
- Effective vectorization of irregular code. Sparse and graph workloads remain poorly vectorized by automatic compilers; explicit SIMD intrinsics or DSLs dominate.
- End-to-end verified compilation of full languages. CompCert/CakeML cover C and SML respectively; verified compilation of larger languages (full C++, Rust, Java) remains a grand challenge.
- Compiler bugs persist. Production compilers still average ~1 confirmed miscompilation bug per week (LLVM issue tracker). Differential testing and SMT-driven verification narrow the gap.
Further reading
- Aho, A. V., M. S. Lam, R. Sethi, and J. D. Ullman 2006. Compilers: Principles, Techniques, and Tools (2nd ed.) — the “Dragon Book.”
- Appel, A. W. 2002. Modern Compiler Implementation in ML / Java / C — the “Tiger Book.”
- Cooper, K. D. and L. Torczon 2011. Engineering a Compiler (2nd ed.).
- Nielson, F., H. R. Nielson, and C. Hankin 1999. Principles of Program Analysis.
- Møller, A. and M. I. Schwartzbach 2018. Static Program Analysis. Aarhus University lecture notes.
- Muchnick, S. S. 1997. Advanced Compiler Design and Implementation.
- Pierce, B. C. 2002. Types and Programming Languages.
- Pierce, B. C. 2005. Advanced Topics in Types and Programming Languages.
- Allen, R. and K. Kennedy 2001. Optimizing Compilers for Modern Architectures.
- Bondhugula, U. 2008. “A Practical Automatic Polyhedral Parallelizer and Locality Optimizer.” Ph.D. dissertation, Ohio State.
- Cousot, P. and R. Cousot 1977. “Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints.” POPL.
- Leroy, X. 2009. “Formal verification of a realistic compiler.” Communications of the ACM 52.
- Yang, X., Y. Chen, E. Eide, and J. Regehr 2011. “Finding and understanding bugs in C compilers.” PLDI.
- Lattner, C. et al. 2021. “MLIR: A compiler infrastructure for the end of Moore’s law.” CGO.
- Sarkar, V., S. Pande 2020. Compiler Construction. Springer-Verlag Lecture Notes in Computer Science.