CUDA + Triton + GPU Programming — Compute Reference
1. At a glance
GPUs dominate modern compute for machine learning, high-performance computing (HPC), graphics, cryptography, and large-scale simulation. The arithmetic-to-bandwidth ratio of contemporary accelerators — tens of TFLOPS dense FP16/BF16 against terabytes per second of high-bandwidth memory (HBM) — makes them the only practical substrate for training and serving transformer models above the billion-parameter scale.
NVIDIA CUDA (introduced 2007 with the G80 architecture) is the de-facto industry standard, with a roughly two-decade head start on a closed proprietary ecosystem (driver, compiler, libraries, profiler, debugger). Major alternatives:
- AMD ROCm + HIP (2016+): open-source, CUDA-syntax-compatible (
hipifytranslates source). - Apple Metal + MPS + MLX (Metal 2014, MPS 2015, MLX 2023): unified-memory M-series silicon.
- Intel oneAPI + SYCL (DPC++) (2020+): portable across CPU/GPU/FPGA.
- Vulkan compute (Khronos 2016): graphics-API compute path used by
llama.cppandmlc-llmfor cross-vendor inference. - OpenCL (Khronos 2009): legacy portable compute; largely superseded by SYCL/Vulkan.
- WebGPU (W3C, Chrome 113 in 2023; Safari 18 in 2024; Firefox behind flag): browser-side compute.
Triton (Tillet, OpenAI, 2019; open-source 2021) is a Python-embedded domain-specific language (DSL) that compiles to PTX/AMDGCN, drastically lowering the cost of writing efficient GPU kernels.
2024-2026 frontier hardware:
- NVIDIA Hopper: H100 (2022), H200 (2024, 141 GB HBM3e).
- NVIDIA Blackwell: B100/B200 (2024), GB200 NVL72 rack (72× B200 + 36× Grace), GB300 (announced 2025).
- AMD CDNA3/CDNA4: MI300X (192 GB HBM3, 2023), MI325X (288 GB HBM3e, 2024), MI355X (288 GB HBM3e + FP4, 2025).
- Google TPU: v5p (2023), Trillium / v6 (2024), Ironwood (2025).
- Cerebras CS-3 (wafer-scale, 900,000 cores, 2024).
- Groq LPU (deterministic SRAM inference, 2024).
This note focuses on programmable GPUs (NVIDIA + AMD primarily); fixed-function accelerators (TPU, LPU, wafer-scale) are covered tangentially.
2. GPU architecture — SMs, warps, memory hierarchy
A NVIDIA GPU is structured as an array of Streaming Multiprocessors (SMs). Each SM contains:
- A pool of CUDA cores (FP32 ALUs), typically organised in sub-partitions (4 per SM on Hopper/Blackwell).
- Tensor Cores for matrix-multiply-accumulate (MMA) at reduced precision.
- RT cores (ray-tracing intersection) on consumer parts (Ada Lovelace, Blackwell consumer); absent on data-center compute parts.
- A register file (256 KB on Hopper, per-SM).
- Shared memory / L1 (combined, partitionable on Hopper at 228 KB).
- Warp schedulers + dispatch units.
The fundamental execution unit is the warp: 32 threads (NVIDIA) or wavefront: 64 threads (AMD), executed in lockstep under SIMT (Single Instruction, Multiple Threads). When threads in a warp diverge on a branch, the hardware serialises the paths (predication or stack-based reconvergence on Volta+).
Memory hierarchy (NVIDIA H100 representative):
| Level | Size | Bandwidth | Latency |
|---|---|---|---|
| Register file (per SM) | 256 KB | ~peak | 1 cycle |
| Shared mem / L1 (per SM) | 228 KB | ~19 TB/s | ~30 cycles |
| L2 cache (chip-wide) | 50 MB | ~5 TB/s | ~200 cycles |
| HBM3 (global) | 80 GB | 3.35 TB/s | ~400-600 cycles |
B200 (Blackwell) headline numbers:
- 192 SMs, 2 dies + NV-HBI interconnect (10 TB/s die-to-die).
- 192 GB HBM3e @ 8 TB/s (8 stacks × 24 GB).
- 4 TB/s NVLink 5 per GPU (1.8 TB/s each direction × 2 dirs, 2× H100).
- 20 PFLOPS dense FP4, 10 PFLOPS dense FP8 (per chip, with sparsity 2×).
AMD MI300X (CDNA3, 2023):
- 304 compute units (CUs) × 64-thread wavefronts (~19k threads).
- 192 GB HBM3 @ 5.3 TB/s (8 stacks).
- 2.6 PFLOPS dense FP8, 1.3 PFLOPS dense BF16.
The trend across generations: register file and shared memory grow modestly; HBM capacity and bandwidth grow roughly 2× per generation; Tensor Core throughput grows 4-8× per generation as precision drops.
3. Tensor Cores — the matmul accelerator
Introduced with Volta (V100, 2017), Tensor Cores execute small fixed-shape matrix-multiply-accumulate operations in a single instruction. Each generation has widened the supported shapes and lowered the supported precision:
| Generation | Year | Shape (M×N×K) | Precisions |
|---|---|---|---|
| Volta (V100) | 2017 | 4×4×4 | FP16 → FP32 accum |
| Turing (T4) | 2018 | 8×8×4 | + INT8, INT4, INT1 |
| Ampere (A100) | 2020 | 16×16×16 | + BF16, TF32, structured sparsity 2:4 |
| Hopper (H100) | 2022 | up to 64×256×K (WGMMA) | + FP8 (E4M3/E5M2), TMA |
| Blackwell (B100/200) | 2024 | up to 256×K (TCGen5) | + FP4, FP6, microscaling MX formats |
Throughput (dense, no sparsity), per chip:
- A100: 312 TFLOPS BF16/FP16
- H100 SXM: 989 TFLOPS BF16, 1979 TFLOPS FP8
- H200 SXM: same compute, more memory
- B100: ~7 PFLOPS FP4, ~3.5 PFLOPS FP8
- B200: ~14 PFLOPS FP4, ~10 PFLOPS FP8 (dual-die)
FP8 formats (Hopper): E4M3 (4 exponent, 3 mantissa) for activations/weights; E5M2 (5 exponent, 2 mantissa) for gradients. Defined jointly by NVIDIA + Arm + Intel in 2022.
FP4 formats (Blackwell): E2M1 (2 exp, 1 mantissa); microscaling MX FP4 uses per-block (32 element) FP8 scale to retain dynamic range; standardised by Open Compute Project (OCP) in 2023.
Saturating Tensor Cores requires the workload to be (a) compute-bound, not memory-bound, (b) tiled to match the supported shapes, (c) properly aligned in memory. CUTLASS and cuBLAS provide tuned templates; Triton emits tensor-core instructions automatically when block shapes are compatible.
4. CUDA programming model
CUDA exposes a hierarchical SIMT execution model:
- Grid of blocks of threads.
- Each thread has built-in variables:
threadIdx.{x,y,z},blockIdx.{x,y,z},blockDim.{x,y,z},gridDim.{x,y,z}. - A block runs entirely on one SM; threads in a block can synchronise via
__syncthreads()and share data via__shared__memory. - A warp (32 threads, contiguous within a block) executes in SIMT lockstep.
Kernel launch syntax (C++):
my_kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args...);sharedMemBytes is the dynamic shared-memory allocation; stream selects the CUDA stream (default 0 is synchronous with respect to other default-stream work).
Cooperative groups (CUDA 9+): typed handles for thread groups (warp, block, multi-block, grid, multi-grid). Enables warp-level shuffles, block-wide reductions, grid-wide sync (requires cooperative launch).
Streams + events: CUDA streams are FIFO queues of work; operations in different streams may overlap. cudaEvent_t marks points for cross-stream synchronisation and timing. Multi-stream + multi-engine (copy + compute) is the basis of pipelining HBM↔compute.
Unified Memory (UM): introduced CUDA 6 (Kepler); page-migration based managed memory. cudaMallocManaged allocates pointers usable from both host and device; the driver migrates pages on demand. Useful for prototyping; explicit cudaMemcpy + pinned host memory remains higher performance for steady-state workloads.
CUDA Graphs (CUDA 10+): record a DAG of kernels + memcpys once, replay many times. Eliminates per-launch driver overhead (~5-10 μs per launch), critical for inference workloads with many small kernels.
Hopper additions:
- TMA (Tensor Memory Accelerator): asynchronous bulk HBM↔shared-memory transfers programmed by descriptor; replaces complex coalesced-load loops.
- WGMMA (Warpgroup Matrix-Multiply-Accumulate): 4-warp cooperative matmul instruction.
- DPX: dynamic programming instructions (Smith-Waterman, Needleman-Wunsch).
- Thread Block Cluster: 2-16 thread blocks scheduled together with shared distributed shared memory.
5. Memory + performance — what makes kernels fast
Coalesced global memory access. When a warp accesses 32 consecutive 4-byte words from a 128-byte-aligned base, the hardware issues a single 128-byte transaction. Misaligned or strided access serialises into multiple transactions. Restructuring data (struct-of-arrays vs array-of-structs) is the single most common optimisation.
Bank conflicts in shared memory. Shared memory is divided into 32 banks (4-byte stride); concurrent accesses to the same bank serialise. Padding arrays (array[N][33] instead of array[N][32]) or permuted addressing avoids conflicts.
Occupancy is the number of resident warps per SM relative to the hardware maximum (64 on most parts). Limited by registers per thread, shared memory per block, and warps per block. Higher occupancy hides memory latency through warp-level parallelism, but extreme tuning sometimes prefers lower occupancy with more registers (e.g. FlashAttention).
Roofline model (Williams, Waterman, Patterson 2009): operational intensity (FLOPS/byte) on x-axis, achievable performance on y-axis. The roofline is min(peak_compute, peak_bandwidth × intensity). Kernels left of the ridge point are memory-bound; right are compute-bound. Triton + Nsight Compute report intensity directly.
Tensor-core saturation requires:
- Matmul shape divisible by tensor-core tile (e.g. 16×16×16 or 64×N×K on Hopper WGMMA).
- Operands aligned to 16 bytes or wider.
- Loads through TMA (Hopper) or
cp.async(Ampere) for async double-buffering. - Accumulator type sized to avoid spilling (FP32 accum for FP16 inputs).
6. CUDA libraries — the canonical stack
| Library | Purpose | Key APIs |
|---|---|---|
| cuBLAS | Dense BLAS levels 1-3 + ex. gemmEx | cublasGemmEx, cublasLtMatmul (cuBLASLt) |
| cuSPARSE | Sparse linear algebra (CSR/COO/BSR) | cusparseSpMM, cusparseSDDMM |
| cuSOLVER | Dense + sparse direct solvers (LU/QR/Cholesky/SVD/eigen) | cusolverDnSgeqrf etc. |
| cuFFT | FFT 1D/2D/3D, batched, multi-GPU (cuFFTMp) | cufftExecC2C etc. |
| cuDNN | DL primitives (conv, pool, RNN, attention since 9.0) | Graph API replaces legacy API |
| NCCL | Multi-GPU collectives (all-reduce, broadcast, all-gather, reduce-scatter) | ncclAllReduce etc. |
| CUTLASS | C++ template library for high-perf GEMM kernels | cutlass::gemm::device::Gemm |
| CUB | Block/warp/device primitives (scan, reduce, sort) | cub::DeviceScan::ExclusiveSum |
| Thrust | STL-like algorithms on device vectors | thrust::sort, thrust::transform_reduce |
| TensorRT | Inference engine: layer fusion + quantization + kernel selection | IBuilder, ICudaEngine |
| TensorRT-LLM | LLM-specific inference (paged KV, IFB, in-flight batching) | (covered in [[Compute/inference-optimization]]) |
| nvJPEG / NPP | Image / signal processing |
CUTLASS (Kerr, NVIDIA, 2017+) deserves special mention: a header-only C++ template library implementing tiled GEMM with explicit data movement (copy(), gemm() algorithms in CuTe layout DSL, CUTLASS 3.x). It is the basis for FlashAttention, vLLM’s paged attention, and many production custom kernels. Provides templates for FP8/FP4 mixed-precision matmul on Hopper/Blackwell.
cuDNN 9 (2024) introduced the graph API as the primary interface, deprecating the legacy per-op API. The graph API expresses fused subgraphs (conv + bias + activation, attention, layer-norm) that the engine can fuse + tile + autotune.
NCCL 2.20+ supports SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) on NDR InfiniBand for in-network reduction.
7. Triton — Python-embedded GPU DSL
Triton (Tillet, Cortes, Cox, 2019; OpenAI open-sourced 2021) is a Python-embedded DSL for writing GPU kernels at the block level: the programmer reasons about blocks of data, and the compiler handles thread-level scheduling, vectorisation, memory coalescing, and tensor-core dispatch.
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
pid = tl.program_id(0)
offs = pid * BLOCK + tl.arange(0, BLOCK)
mask = offs < n
x = tl.load(x_ptr + offs, mask=mask)
y = tl.load(y_ptr + offs, mask=mask)
tl.store(out_ptr + offs, x + y, mask=mask)The compiler stack: Python AST → Triton IR → MLIR (since 2022) → LLVM IR → PTX (NVIDIA) or AMDGCN (AMD). Autotuning is built in via triton.autotune with configuration grids over block sizes, num_warps, num_stages.
Used by:
- PyTorch 2.0+ Inductor backend generates Triton kernels for pointwise, reduction, and matmul fusions.
- FlashAttention 2 (Dao 2023) and FlashAttention 3 (Shah, Dao, et al. 2024) — Hopper-tuned versions in Triton.
- vLLM paged-attention kernels.
- Unsloth fine-tuning kernels.
- Many published research kernels (Mamba selective scan, Liger kernels for LLM training).
Triton trades raw control for productivity: complex kernels (warp specialisation, intricate async pipelining) often still require CUDA C++ + CUTLASS. The gap is narrowing — Triton 3.0 (2024) added TMA + warp specialisation support.
8. PyTorch 2.0 — torch.compile
PyTorch 2.0 (March 2023) introduced torch.compile, the largest framework change since the framework’s release. The pipeline:
- TorchDynamo (Python-level frame evaluation; PEP 523) traces Python bytecode → FX graph, falling back to eager for unsupported constructs.
- AOTAutograd generates forward + backward graphs ahead of time.
- PrimTorch decomposes ATen operations to ~250 primitive ops.
- TorchInductor is the default backend: lowers FX → Triton (GPU) or C++/OpenMP (CPU) kernels with fusion + scheduling.
Typical speedups: 1.3-2× for training, 1.5-3× for inference, with no model code changes for most pure-PyTorch models. Modes: default, reduce-overhead (CUDA graphs), max-autotune (autotune kernel configs).
Limitations: Python side-effects, data-dependent control flow, and dynamic shapes still cause graph breaks. PyTorch 2.2+ improved dynamic shape handling significantly. PyTorch 2.4+ added torch.export for AOT compilation suitable for serving.
9. JAX + XLA
JAX (Bradbury et al., Google, 2018+) is a NumPy-compatible array library with composable function transformations:
jit(f)— JIT compile via XLA.grad(f)— reverse-mode autodiff.vmap(f)— auto-batch over an axis.pmap(f)— parallel map across devices.shard_map(f)(jax 0.4.6+, 2023) — explicit SPMD with manual sharding annotations.
XLA (Accelerated Linear Algebra, Google 2017+) is the compiler backend. JAX → traced StableHLO IR (post-2023; replaces HLO) → backend-specific code generation:
- TPU via PJRT (Pluggable Runtime, 2023+).
- GPU via CUDA codegen + cuDNN + cuBLAS.
- CPU via LLVM.
GSPMD (General-purpose SPMD, Xu et al. 2021) — XLA’s auto-parallelisation system; given sharding annotations, partitions ops across devices. JAX exposes via jax.sharding.NamedSharding + jax.experimental.mesh_utils.
JAX is dominant in Google research (Brain merged into DeepMind 2023), with extensive use at Anthropic, Cohere, and academic labs (Flax, Haiku, Equinox neural network libraries; Optax optimisers; Numpyro probabilistic programming).
10. AMD ROCm + HIP
ROCm (Radeon Open Compute, AMD 2016+) is the open-source equivalent of CUDA: kernel-mode driver, runtime, compiler (LLVM-based), libraries, profiler.
HIP (Heterogeneous Interface for Portability) is the programming model — syntactically a near-superset of CUDA C++:
hipLaunchKernelGGL(my_kernel, dim3(grid), dim3(block), 0, 0, args...);
// or hipMalloc / hipMemcpy / hipDeviceSynchronizeThe hipify-perl and hipify-clang tools translate CUDA source to HIP source mechanically (most kernels work with no semantic change). Compiled by hipcc to AMD GPU code, or by NVCC to NVIDIA code (HIP is portable in both directions).
ROCm library stack (one-to-one with CUDA):
rocBLAS↔ cuBLASrocFFT↔ cuFFTrocSPARSE↔ cuSPARSErocSOLVER↔ cuSOLVERMIOpen↔ cuDNN (deep learning primitives)rccl↔ NCCL (multi-GPU collectives)Composable Kernel (CK)↔ CUTLASS (template GEMM)hipRAND↔ cuRAND
Frontier AMD compute parts:
- MI300X (CDNA3, 2023): 192 GB HBM3, 5.3 TB/s, 1307 TFLOPS BF16, FP8 supported.
- MI325X (CDNA3+, 2024): 288 GB HBM3e, 6 TB/s.
- MI355X (CDNA4, 2025): 288 GB HBM3e + FP4/FP6 support, ~9.2 PFLOPS FP4.
ROCm 6.x (2024) closed much of the gap on LLM workloads — vLLM, PyTorch, transformers, FlashAttention all run with comparable per-FLOP utilisation. NCCL replacement RCCL with cross-vendor support remains a known weak spot.
11. Other accelerator software stacks
Apple Metal + MPS + MLX
- Metal (Apple 2014) is the low-level graphics + compute API for Apple GPUs.
- MPS (Metal Performance Shaders, 2015) provides high-perf primitives (matmul, conv); PyTorch’s MPS backend dispatches here.
- MLX (Apple, December 2023) is a Python-first ML framework designed for M-series unified memory — arrays live in shared physical memory accessible from CPU and GPU without copies. Supports lazy evaluation, distributed (since 2024), and quantisation.
Intel oneAPI + SYCL (DPC++)
- SYCL (Khronos 2014+) is C++-based single-source heterogeneous programming; DPC++ is Intel’s implementation with extensions.
- oneMKL, oneDNN, oneCCL are the library equivalents.
- Targets Intel Data Center GPU Max (Ponte Vecchio, 2022) + future Falcon Shores; also runs on NVIDIA via Codeplay’s plugin (oneAPI for CUDA).
Vulkan compute
- Compute pipelines in the Vulkan graphics API; SPIR-V is the kernel IR.
- Used by
llama.cpp(ggml-vulkanbackend),mlc-llm(cross-vendor LLM inference), MoltenVK on macOS. - Trades performance for portability — no tensor-core access on NVIDIA; basic compute only.
OpenCL
- Khronos 2009; portable kernel-language compute model.
- Largely superseded by SYCL (higher-level, single-source) and Vulkan compute (more modern API).
- Still used in some embedded + signal-processing contexts.
WebGPU
- W3C standard (2023); WGSL shading language + JavaScript API.
- Chrome 113+ (May 2023), Safari 18 (2024), Firefox behind flag.
- Used by
transformers.js(Xenova),web-llm(MLC), and ONNX Runtime Web for in-browser inference. - No tensor-core / FP16 matmul intrinsics in standard yet; compute capability is roughly 2018-era discrete GPU.
12. Multi-GPU + distributed training
NCCL (NVIDIA Collective Communications Library) is the canonical communication layer:
- All-reduce (sum gradients across all GPUs).
- All-gather (collect shards into full tensor).
- Reduce-scatter (sum then shard).
- Broadcast, send/recv, gather, scatter.
Algorithms: ring all-reduce (Patarasuk + Yuan 2009) is optimal for bandwidth; tree all-reduce is optimal for latency. NCCL picks based on message size + topology.
Interconnects:
- NVLink 4 (Hopper): 900 GB/s per GPU aggregate (18 links × 50 GB/s).
- NVLink 5 (Blackwell): 1.8 TB/s per GPU.
- NVSwitch: non-blocking GPU-to-GPU fabric; H100 8-GPU node has 4 NVSwitches; GB200 NVL72 has 18 NVSwitches across 9 trays for full 72-GPU NVLink domain.
- InfiniBand HDR (200 Gbit/s), NDR (400 Gbit/s), XDR (800 Gbit/s, 2025); Mellanox/NVIDIA ConnectX-7/8 NICs.
- Ethernet RoCEv2 (RDMA over Converged Ethernet) is the open alternative; major hyperscalers use 400-800 GbE Ethernet (Spectrum-X, Tomahawk 5).
GPUDirect:
- GPUDirect P2P: GPU-GPU DMA over PCIe or NVLink.
- GPUDirect RDMA: NIC→GPU memory DMA without CPU bounce.
- GPUDirect Storage: NVMe→GPU memory DMA, bypassing the OS page cache.
Parallelism strategies:
| Strategy | What’s split | Comm cost | When |
|---|---|---|---|
| Data Parallel (DP) | Batch | All-reduce per step | Default; small models |
| DDP (PyTorch) | Batch, sync grads | Ring all-reduce | Standard multi-GPU training |
| ZeRO Stage 1 | Optimizer state | Reduce-scatter + all-gather | Reduces mem ~4× |
| ZeRO Stage 2 | + Gradients | Same | Reduces mem ~8× |
| ZeRO Stage 3 | + Parameters | Per-layer all-gather | Reduces mem N× |
| FSDP (PyTorch) | All three (ZeRO-3 native impl) | All-gather + reduce-scatter | Recommended in PyTorch 2 |
| Tensor Parallel (TP) | Matmuls split across GPUs | All-reduce per layer | Within-node, NVLink |
| Pipeline Parallel (PP) | Layers split across GPUs | P2P send/recv | Across-node, IB |
| Expert Parallel (EP) | MoE experts | All-to-all | MoE models |
| Sequence Parallel | Sequence dim of activations | Small | Long-context LLMs |
DeepSpeed ZeRO (Rajbhandari, Microsoft, 2020) introduced the three-stage memory partitioning. The key insight: optimizer states (Adam: 2× params in FP32) + gradients (1× params) + params (1× FP16) can each be sharded across N GPUs without changing math.
PyTorch FSDP (Fully Sharded Data Parallel, 2022+) is the in-tree ZeRO-3 implementation; FSDP2 (PyTorch 2.4, 2024) reworked the API to use DTensor (distributed tensor).
Megatron-LM (Shoeybi et al., NVIDIA, 2019) introduced tensor parallelism for transformers (column-parallel + row-parallel linear layers, reducing attention head-wise). Combined with PP and DP, this is the classical “3D parallelism” used to train models above 100B parameters (GPT-3, Megatron-Turing NLG, Llama 3 405B, etc.).
Pipeline scheduling: GPipe (Huang 2019) introduced micro-batches; PipeDream (Narayanan 2019) added 1F1B (one-forward-one-backward) for steady-state throughput; Megatron interleaved 1F1B further reduces bubble.
13. Profiling + debugging
NVIDIA tools:
- Nsight Systems (
nsys): system-level timeline (CPU + GPU + NIC + storage). Captures CUDA API, OS scheduling, NVTX ranges; outputs.nsys-repviewable in GUI. Use for top-down “where is time going” analysis. - Nsight Compute (
ncu): kernel-level metrics (occupancy, achieved memory throughput, warp stall reasons, tensor core utilisation, source/SASS correlation). Use for deep-dive optimisation of a specific kernel. - CUDA-GDB: kernel-level debugger; supports breakpoints in device code, single-stepping, memory inspection.
- Compute Sanitizer (replaces
cuda-memcheck): toolsmemcheck,racecheck,initcheck,synccheckfor race conditions, uninitialised reads, out-of-bounds. nvidia-smi: live device state (utilisation, memory, temperature, power, ECC).- DCGM (Data Center GPU Manager): cluster-scale telemetry, exporters for Prometheus.
Framework profilers:
- PyTorch profiler (
torch.profiler): kernel traces with stack frames; exports to TensorBoard Plugin (PyTorch Profiler) or Chrome trace format. Memory profiler (torch.profiler._memory_profiler) tracks allocation events. - JAX profiler (
jax.profiler.start_trace): integrates with TensorBoard XPlane; works on TPU + GPU. - HuggingFace
accelerate+transformers.Trainerwrap PyTorch profiler.
AMD tools:
rocprof/rocprofv2/rocprofv3: kernel profiling.- Omnitrace (AMD Research): timeline profiler.
- Omniperf: kernel-level metrics (occupancy, mem throughput).
- ROCm Compute Profiler (formerly Omniperf): GUI for kernel analysis.
14. Common kernels + patterns
GEMM (general matrix multiply) — the foundational kernel. Modern implementations: tile-based (e.g. 128×128×64 K-stage), tensor-core MMA inner loop, async TMA loads on Hopper, software pipelining with cp.async. CUTLASS provides templates; cuBLAS provides tuned binaries.
Convolution — three main algorithms:
- im2col + GEMM: classic; reduces conv to matmul at cost of memory blowup.
- Implicit GEMM: cuDNN approach; no materialised im2col buffer.
- Winograd (Lavin + Gray 2016): reduces multiplications for small filters (3×3) at cost of more additions.
- FFT convolution: for large filters; uses cuFFT.
Reduction — sum/max/min over an array. Tree reduction within a warp using __shfl_xor_sync; cross-warp via shared memory; cross-block via atomic or second-stage kernel. CUB provides BlockReduce and DeviceReduce.
Scan (prefix sum) — output[i] = sum(input[0..i]). Two classical algorithms: Kogge-Stone (log-depth, more work) and Blelloch (more steps, fewer ops). CUB’s DeviceScan::ExclusiveSum is the production version.
Sort — Radix sort (Merrill + Grimshaw 2010) dominates for fixed-key-size; CUB’s DeviceRadixSort is the reference implementation. Segmented sort for sort-by-key with multiple segments.
SpMV (sparse matrix-vector) — CSR, ELL, HYB formats; cuSPARSE provides tuned versions. For deep learning, structured 2:4 sparsity (Ampere+) lets Tensor Cores skip zeros.
Attention — scaled dot-product softmax(QK^T / √d) V. FlashAttention (Dao, Fu, Ermon, Ré 2022) restructured to be IO-aware: tiles in shared memory, online softmax, no materialisation of N×N attention matrix. FlashAttention-2 (Dao 2023) improved warp partitioning; FlashAttention-3 (Shah, Bikshandi, Zhang, Fegade, Thakkar, Dao 2024) added Hopper TMA + WGMMA + async pipelining + FP8.
Softmax — online numerically stable softmax (Milakov + Gimelshein 2018): two-pass becomes one-pass via running max + running denominator.
LayerNorm + RMSNorm — Welford’s online algorithm for mean + variance; RMSNorm (Zhang + Sennrich 2019) skips the mean for ~10% speedup.
Top-K — bitonic / radix-based; used in beam search, sparse-MoE routing.
15. 2024-2026 trends
Microscaling (MX) formats — OCP MX standard (2023): per-block (32 elements) FP8 scale enables FP4/FP6/FP8 representations with high dynamic range. Blackwell (B200) supports MX FP4 (E2M1) and MX FP6 natively. Llama-quant, AWQ, MX quantisation tooling all moving to this.
Tensor Memory Accelerator (TMA) — Hopper async bulk memory engine; effectively a DMA engine for tensor descriptors. Frees warps from address arithmetic; central to FlashAttention-3 + CUTLASS 3.x.
GB200 NVL72 — 72× B200 + 36× Grace ARM CPUs in a single rack with 5th-gen NVLink + NVSwitch fabric. ~1.4 EFLOPS FP4 per rack. Targeted at trillion-parameter inference + frontier training.
CXL (Compute Express Link) — open coherent interconnect over PCIe physical layer; memory pooling and disaggregation expected to relieve HBM capacity pressure. CXL 3.0 (2022) added switching; products shipping 2024-2025.
Grace + LPDDR5X unified memory — NVIDIA Grace ARM CPU (Neoverse V2) ships with LPDDR5X-based unified memory pool shared with Hopper/Blackwell GPU via NVLink-C2C (900 GB/s). Reduces effective host-device boundary.
ROCm closing the gap — ROCm 6.0+ ships first-class PyTorch + vLLM + FlashAttention support; MI300X production deployments at Microsoft, Meta, Oracle.
Triton matures into the default kernel layer — PyTorch Inductor + JAX Pallas (2024 JAX extension that compiles to Triton + TPU Mosaic) suggest Triton becomes the lingua franca of custom GPU kernels.
Custom silicon proliferation — Cerebras CS-3, Groq LPU, SambaNova SN40L, Tenstorrent Grayskull / Wormhole all in production deployments by 2025; each with bespoke software stack.
16. Selection heuristics
- Standard PyTorch training, single node:
torch.compile(model, mode="reduce-overhead")+bf16+ Triton-backed Inductor; defaults work well. - PyTorch training, multi-node: FSDP2 +
bf16mixed precision +torch.compileper-layer; Megatron-LM if model >70B. - Custom kernel needed for performance: Triton first; CUDA C++ + CUTLASS only if Triton can’t express (warp-specialised producer-consumer patterns, sub-warp synchronisation).
- Inference serving: TensorRT-LLM (NVIDIA-only, max performance) or vLLM (cross-vendor, paged attention, continuous batching). Covered in
[[Compute/inference-optimization]]. - Multi-vendor target (AMD + NVIDIA): JAX (XLA generates code for both) or PyTorch + ROCm + portable Triton. Avoid hand-written CUDA C++.
- Apple Silicon development: MLX (best perf, native unified memory) or PyTorch MPS backend (more familiar API).
- Embedded / edge: ONNX Runtime + TensorRT (NVIDIA Jetson) or TFLite + GPU delegate (mobile) or LiteRT (TFLite successor, 2024). For Apple, Core ML.
- Browser inference:
transformers.js(Xenova, ONNX Runtime Web) for small models;web-llm(MLC + WebGPU) for LLMs.
17. Pitfalls
- Small kernels with high launch overhead: each kernel launch is ~5-10 μs; many small kernels (e.g. unfused pointwise ops) burn more time on launches than compute. Fix: fuse via
torch.compile, write a Triton kernel, or use CUDA Graphs. - Non-coalesced memory access: array-of-structs vs struct-of-arrays; misaligned bases; strided access. Profile with Nsight Compute
dram__sectors_read_per_sector_op. - Bank conflicts in shared memory: pad inner dimension or use permuted addressing (
xorof column with row). - Over-tuning for one GPU: hand-tuned block sizes / register counts for A100 may be suboptimal on H100/B200; rely on autotuning where possible.
- FP16 dynamic range overflow: loss/gradient scaling needed; modern frameworks handle automatically via GradScaler. BF16 has FP32 exponent range, usually no scaling required.
- FP8 calibration: per-tensor or per-channel scaling factors required; calibration step needed before serving.
- CUDA / driver version mismatch: PyTorch wheels are built against specific CUDA versions; mixing versions causes obscure failures. Use
nvidia-smito check driver,nvcc --versionfor CUDA toolkit. - NCCL hangs: usually a topology / firewall / NIC issue; set
NCCL_DEBUG=INFOand inspect logs; checkNCCL_IB_HCA,NCCL_SOCKET_IFNAME. - OOM during ZeRO-3 / FSDP: gather-and-shard causes peak memory spikes; consider activation checkpointing, smaller bucket size, CPU offload.
- Atomics contention: many threads atomically updating the same address serialise. Prefer block-level reduction then single atomic.
__syncthreadsdivergence: calling__syncthreadsinside a divergent branch (where some threads in the warp skipped it) is undefined behavior (pre-Volta) or hangs (Volta+ with independent thread scheduling).- Hot-loop allocator pressure: per-step
cudaMallocis slow; use caching allocator (PyTorch default) or explicit memory pools. - Pinned-memory exhaustion:
cudaMallocHostallocates page-locked memory; over-allocating starves the OS.
18. Cross-references
[[Compute/transformer-architecture]]— the model class that drives most modern GPU demand.[[Compute/inference-optimization]]— TensorRT-LLM, vLLM, SGLang, quantisation, speculative decoding.[[Compute/fine-tuning-rlhf]]— LoRA / QLoRA / DPO on GPUs.[[Compute/rag-embeddings-vector-search]]— vector index workloads on GPU.[[Math/numerical-linear-algebra]]— BLAS, LAPACK, conditioning.[[Math/linear-algebra-essentials]]— foundations.[[Engineering/Tier3/semiconductor-materials]]— HBM stacks (TSV, silicon interposer).[[Engineering/Tier3/semiconductor-packages]]— CoWoS, TSMC SoIC, NVLink physical layer.[[Compute/concurrency-primitives]]— host-side concurrency primitives complementing CUDA streams.[[Compute/cpu-cache-performance]]— analogous memory hierarchy reasoning on CPU side.[[Compute/observability-stack]]— DCGM + Prometheus + NVTX integration.
19. Citations
Books + reference:
- Cook, S. — CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs (Morgan Kaufmann, 2013).
- Wilt, N. — The CUDA Handbook: A Comprehensive Guide to GPU Programming, 2nd ed. (Addison-Wesley, 2024).
- Kirk, D. + Hwu, W.-m. — Programming Massively Parallel Processors, 4th ed. (Morgan Kaufmann, 2022).
- Sanders, J. + Kandrot, E. — CUDA by Example (Addison-Wesley, 2010) — historical reference.
Foundational papers:
- Lindholm, E., Nickolls, J., Oberman, S., Montrym, J. — “NVIDIA Tesla: A Unified Graphics and Computing Architecture” (IEEE Micro 2008).
- Williams, S., Waterman, A., Patterson, D. — “Roofline: An Insightful Visual Performance Model for Multicore Architectures” (CACM 2009).
- Patarasuk, P. + Yuan, X. — “Bandwidth optimal all-reduce algorithms for clusters of workstations” (JPDC 2009).
- Lavin, A. + Gray, S. — “Fast Algorithms for Convolutional Neural Networks” (Winograd, CVPR 2016).
Triton + JAX + frameworks:
- Tillet, P., Kung, H.T., Cox, D. — “Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations” (MAPL 2019).
- Bradbury, J. et al. — JAX: Composable transformations of Python+NumPy programs (2018+, jax.readthedocs.io).
- Lattner, C. et al. — “MLIR: A Compiler Infrastructure for the End of Moore’s Law” (2020).
- Sabne, A. — “XLA: Compiling Machine Learning for Peak Performance” (Google whitepaper).
Attention + LLM kernels:
- Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C. — “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (NeurIPS 2022).
- Dao, T. — “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” (2023).
- Shah, J., Bikshandi, G., Zhang, Y., Fegade, V., Thakkar, V., Dao, T. — “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision” (2024).
- Milakov, M. + Gimelshein, N. — “Online normalizer calculation for softmax” (2018).
Distributed:
- Rajbhandari, S., Rasley, J., Ruwase, O., He, Y. — “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” (SC 2020).
- Shoeybi, M. et al. — “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism” (2019).
- Narayanan, D. et al. — “PipeDream: Generalized Pipeline Parallelism for DNN Training” (SOSP 2019).
- Huang, Y. et al. — “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism” (NeurIPS 2019).
- Xu, Y. et al. — “GSPMD: General and Scalable Parallelization for ML Computation Graphs” (2021).
Architecture whitepapers:
- NVIDIA “H100 Tensor Core GPU Architecture” (Hopper whitepaper, 2022).
- NVIDIA “Blackwell Architecture” (B100/B200 whitepaper, 2024).
- NVIDIA “GB200 NVL72” (2024 reference architecture).
- AMD “CDNA3 Architecture” (MI300 whitepaper, 2023).
- AMD “CDNA4 Architecture” (MI355 whitepaper, 2025).
- Google “TPU v5p” (Norrie et al. 2024); Google “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning” (ISCA 2023).
Vendor references:
- NVIDIA — CUDA C++ Programming Guide + CUDA C++ Best Practices Guide (docs.nvidia.com/cuda).
- NVIDIA — PTX ISA Reference + Parallel Thread Execution Reference.
- NVIDIA — cuDNN Developer Guide + cuBLAS Documentation + NCCL Documentation + TensorRT Developer Guide.
- OpenAI — Triton Language Reference (triton-lang.org).
- AMD — ROCm Documentation + HIP Programming Guide (rocm.docs.amd.com).
- Khronos — Vulkan 1.3 Specification, SYCL 2020 Specification, OpenCL 3.0 Specification.
- W3C — WebGPU Specification (gpuweb.github.io/gpuweb).
- Apple — Metal Shading Language Specification + MLX Documentation (ml-explore.github.io/mlx).