LLM Inference Optimization — Compute Reference

Quick index

1. At a glance

Training a frontier model is a one-shot capital cost: a few months on tens of thousands of accelerators, then the artifact is frozen. Inference is the opposite — it is a recurring operational cost that runs for every user request for the lifetime of the model. By 2025 most frontier labs reported that aggregate inference compute had surpassed training compute, and by 2026 the gap is roughly an order of magnitude for any model with non-trivial deployment. The economics of LLM systems are therefore the economics of inference, and the engineering of LLM systems is increasingly the engineering of inference servers.

The three quantities every inference engineer optimizes are:

  • Throughput — tokens generated per second per dollar (or per GPU). This is what determines unit economics for a serving fleet.
  • Latency — time-to-first-token (TTFT) for the user’s first response character, and inter-token latency (ITL) for the streaming experience. Frontier chat targets are TTFT < 500 ms and ITL < 50 ms.
  • Memory footprint — weights plus KV cache plus activations must fit in HBM (high-bandwidth memory) on the accelerator. Memory is the binding constraint for most serving workloads, not raw FLOPs.

A single request to a decoder-only Transformer has two distinct phases with different bottlenecks, and optimizing them as if they were the same workload is the most common architectural mistake:

  • Prefill — the entire input prompt is processed in one parallel forward pass. The KV cache is filled, the first output token is produced. Prefill is compute-bound because the matrix-multiplies have high arithmetic intensity (many FLOPs per byte loaded).
  • Decode — output tokens are produced one at a time, autoregressively. Each step is a small matrix-vector multiply, plus an attention over the growing KV cache. Decode is memory-bandwidth-bound because each new token requires loading the full model weights from HBM but performs only a tiny amount of compute per byte loaded.

Different phases need different optimizations. Prefill rewards FlashAttention, tensor parallelism, and FP8/FP4 math. Decode rewards KV cache compression, GQA/MLA, speculative decoding, and continuous batching. State-of-the-art serving systems in 2026 (DistServe, Splitwise, vLLM v1) increasingly disaggregate the two phases onto different GPU pools to optimize each independently.

2. Memory + compute bound regimes

The fundamental quantity is arithmetic intensity — the ratio of compute operations to bytes of data movement, measured in FLOPs/byte. Every accelerator has a characteristic ridge point on its roofline plot: below the ridge, kernels are memory-bound and the FLOPs unit is idle; above the ridge, kernels are compute-bound and the memory subsystem has spare bandwidth.

For prefill, the cost of one token of input through a Transformer with N parameters is approximately 6N FLOPs (the standard 2N for the forward matrix-multiplies, doubled for the QKV plus attention plus MLP structure, with some hand-waving about non-MM ops). For decode, generating one new token requires loading all weights once — approximately 2N bytes in fp16 — but doing only 2N FLOPs of compute on them. The arithmetic intensity of decode is therefore 1 FLOP/byte, which on an H100 (3.35 TB/s HBM bandwidth, ~989 TFLOPs FP16) sits far below the ridge at ~295 FLOPs/byte. Decode is leaving 99.7% of the compute idle waiting on memory.

Concretely on an H100:

  • Prefill 1k tokens through a 70B model: ~420 TFLOPs of work, ~140 GB of weight traffic. Bandwidth-time ~42 ms, compute-time ~425 ms. Compute-bound.
  • Decode 1 token through the same 70B model: 140 GB of weight traffic, 140 GFLOPs of compute. Bandwidth-time ~42 ms, compute-time ~0.14 ms. Memory-bound by 300×.

This asymmetry is why batching is the single most important inference optimization. If you batch eight decode requests together, the weights are loaded once and amortized across eight tokens, lifting arithmetic intensity from 1 to 8 FLOPs/byte. Push the batch to 64 or 128 and you finally start saturating the compute units. The headline throughput numbers in serving-engine benchmarks (e.g., “vLLM serves 70B at 5000 tok/s on 8×H100”) are achieved at batch sizes of hundreds of concurrent requests.

The corollary is that low-latency single-stream inference is fundamentally inefficient — you are paying for a GPU that runs at 0.3% utilization. This is why managed inference providers can be cheaper than self-hosting unless you have continuously high traffic.

3. KV cache

The KV cache is the per-request state that makes autoregressive decoding tractable. Without it, every new token would require recomputing attention over the entire history; with it, each step only needs to compute K and V for the new token and append them to the cache.

The size of the KV cache for a single request is:

KV_bytes = 2 (K and V) · n_layers · n_kv_heads · d_head · seq_len · bytes_per_elem

For LLaMA 2 70B (80 layers, 8 KV heads via GQA, 128 d_head) at 4096 context in fp16:

2 · 80 · 8 · 128 · 4096 · 2 = 1,342,177,280 bytes ≈ 1.25 GB per request

At 32k context this is 10 GB per request. At a batch size of 32, the KV cache alone consumes 320 GB — more than a single H100 can hold. The KV cache, not the weights, is the dominant memory consumer in modern serving.

Before grouped-query attention, the problem was even worse. LLaMA 1 65B used 64 full KV heads per layer, making its KV cache 8× larger than LLaMA 2 70B’s. The shift to GQA was the single biggest serving-cost reduction of the 2023-2024 era.

KV cache lifetime spans the entire request: it grows during prefill, then grows by one entry per decode step. When the request finishes, the cache is freed. Long contexts that the user holds open (chat sessions, agent loops with persistent memory) pin KV cache indefinitely and dominate serving cost.

A useful rule of thumb: at any moment a serving fleet’s HBM budget is divided into three buckets — weights (constant, shared across all requests on a replica), KV cache (per-request, grows with sequence length × batch size), and activations (per-step working memory, freed after each forward). Maximizing concurrency means maximizing the KV-cache budget, which means minimizing per-request KV size — every byte saved by GQA, MLA, or KV quantization translates linearly into more concurrent users on the same hardware.

For long-context workloads (100k+ tokens), KV cache becomes the binding constraint on every dimension: it caps batch size, caps replica utilization, and forces eviction policies. Production serving stacks in 2026 routinely implement multi-tier KV storage — hot tier in HBM, warm tier in CPU DRAM via NVLink-C2C or PCIe, cold tier on local NVMe or disaggregated memory pools.

4. KV cache optimizations

Grouped-Query Attention (GQA) and Multi-Query Attention (MQA)

Shazeer 2019 (MQA) and Ainslie et al. 2023 (GQA) reduce the KV cache by sharing K and V tensors across multiple query heads. In standard MHA, each query head has its own K and V; in MQA, all query heads share a single K and V; in GQA, query heads are partitioned into G groups, each sharing one K and V.

LLaMA 2 70B uses GQA with 8 KV heads against 64 query heads — an 8× reduction in KV cache with no measurable quality loss. Mistral 7B, Mixtral, LLaMA 3, Qwen2/3, and essentially every production model since 2023 use GQA. MQA (the extreme G=1 case) is used in PaLM and Falcon but loses some quality.

Multi-head Latent Attention (MLA)

DeepSeek-V2 (2024) and DeepSeek-V3 (2024) introduced MLA, which compresses the entire KV cache to a low-rank latent representation per token. Instead of storing K and V directly, it stores a low-dimensional latent vector (typically 512 dimensions) from which K and V are reconstructed via a per-head projection at attention time. This achieves a further 4-5× compression on top of GQA-equivalent baselines.

MLA was the architectural ingredient that let DeepSeek-V3 serve a 671B-parameter MoE model with KV cache comparable to a dense 30B with GQA. It is more expensive at inference time (the projection has to be done) but the trade-off is favorable for long contexts.

PagedAttention (vLLM)

Kwon et al. 2023 SOSP introduced PagedAttention as the core of the vLLM serving engine. Standard KV cache allocation pre-reserves a contiguous block for each request’s maximum expected length, leading to massive internal fragmentation when most requests are much shorter than the maximum. PagedAttention applies the operating-system virtual-memory page-table idea: KV cache memory is split into fixed-size blocks (typically 16 tokens), and a per-request block table maps logical token positions to physical blocks. Blocks are allocated lazily and freed on request completion.

The result is near-zero memory fragmentation, the ability to swap blocks to host memory under pressure, and — critically — the foundation for continuous batching at high concurrency. PagedAttention is now standard in vLLM, SGLang, TensorRT-LLM, and TGI.

KV cache quantization

Per-token quantization of the KV cache to int8 or int4 cuts memory in half or quarter. Per-channel calibration is required to maintain accuracy. Notable implementations:

  • KIVI (Liu 2024) — 2-bit KV with per-channel K and per-token V quantization.
  • ZipCache (He 2024) — adaptive bit-width by attention saliency.
  • KVQuant (Hooper 2024) — non-uniform quantization aware of attention sparsity.
  • vLLM, TGI, and TRT-LLM all support int8 KV cache as a runtime flag.

Sliding window attention

Mistral 7B uses a 4096-token sliding window: each token attends only to the previous 4096 tokens, so the KV cache is capped at that size regardless of total context. Equivalent to a local-attention transformer. Works well for tasks where long-range dependencies are weak, breaks for tasks requiring retrieval from distant context.

StreamingLLM

Xiao et al. 2023 noticed that a few initial “attention sink” tokens absorb disproportionate attention mass. StreamingLLM keeps the first 4-8 tokens plus a sliding window of recent tokens, evicting middle tokens. Enables effectively unbounded streaming without quality collapse.

Cross-request prefix sharing

When many requests share a prefix (a long system prompt, a few-shot template, a shared document in a chat), the KV cache for that prefix can be computed once and reused. RadixAttention in SGLang (Zheng 2023) implements this with a radix tree over token prefixes, sharing KV blocks across all requests with a common prefix. Anthropic’s prompt caching, OpenAI’s prompt caching, and Google Gemini’s context caching are all productized versions of this idea, offering ~90% input-token discounts on cached prefixes.

5. Continuous batching / in-flight batching

Static batching — collect N requests, process them together as one batch, return all N when the slowest finishes — is the wrong scheduling model for autoregressive generation. Output lengths vary by 10× or more between requests, so the entire batch waits for the longest generation while shorter ones idle.

The mismatch is structural: with static batching, the GPU executes at the throughput of the slowest request in the batch; with autoregressive output of unbounded length and unpredictable EOS, that worst case is essentially always near the user-set max_tokens. Effective throughput drops to a small fraction of the GPU’s capacity, and admission control becomes fragile — a single long-generation request can hold a batch slot for tens of seconds, blocking dozens of short requests behind it.

Continuous batching (also called in-flight batching or iteration-level scheduling) operates at the granularity of one decode step rather than one request. The scheduler at each step:

  1. Selects the active requests that have a token to decode.
  2. Runs one forward pass for all of them simultaneously.
  3. Returns completed requests immediately to the user.
  4. Admits new requests, running their prefill alongside other requests’ decode if there is spare capacity.

The technique originated in Orca (Yu et al. 2022 OSDI) and was popularized at scale by vLLM (Kwon 2023). It typically delivers 5-20× throughput improvement over static batching for chat workloads.

Production implementations in 2026:

  • vLLM — continuous batching is the default scheduler.
  • TGI (HuggingFace Text Generation Inference) — production-grade since 1.0, used at scale at HuggingFace, Mistral, and most managed providers.
  • NVIDIA TensorRT-LLM — in-flight batching as the primary scheduling mode, tightly integrated with TensorRT kernels.
  • SGLang — continuous batching plus RadixAttention prefix caching.
  • MAX (Modular) — Mojo-implemented engine with continuous batching.
  • DeepSpeed-MII / Inference — Microsoft’s continuous-batching server.

Scheduling trade-offs

Continuous batching exposes a few tunable policies that materially affect production behavior:

  • Max in-flight batch size — caps total concurrency on a replica. Higher means more throughput but more KV cache pressure and longer per-step latency. Tuned per workload.
  • Prefill / decode interleaving — should prefill be done in its own dedicated steps (blocking decode briefly but finishing prefill faster) or chunked into decode steps (smoothing latency but lengthening prefill)? Modern stacks default to chunked prefill with a tunable prefill token budget per step.
  • Eviction policy — when the KV cache is full and a new request arrives, which existing request gets preempted? LRU, longest-context, lowest-priority, or no-preempt-queue are all valid choices with different tail-latency profiles.
  • Admission control — under load, reject new requests rather than degrading service for all. Often integrated with a queue depth check.
  • Priority lanes — production deployments often want to isolate latency-sensitive interactive traffic from throughput-oriented batch jobs. vLLM and SGLang support priority hints in 2026.

6. FlashAttention

The standard attention computation builds a full N×N attention matrix in HBM, applies softmax, then multiplies by V. For N=4096 in fp16 this is a 32 MB intermediate matrix per head per layer per sample. For N=128k it is 32 GB per head, which does not even fit on an H100. The HBM traffic and quadratic memory dominate cost.

Dao et al. 2022 (FlashAttention) reframed attention as a tiled, IO-aware computation. The Q, K, V matrices are split into blocks; for each query block, the algorithm streams through K and V blocks, computing partial softmax statistics with the online softmax trick, accumulating output in SRAM (the GPU’s on-chip cache), and only writing the final output to HBM. The N×N attention matrix is never materialized in HBM. Memory drops from O(N²) to O(N), and HBM traffic drops by roughly the same factor, giving 2-4× wall-clock speedup with no quality loss.

The version history:

  • FlashAttention (Dao 2022) — original tiled algorithm; A100-optimized.
  • FlashAttention-2 (Dao 2023) — better work partitioning across thread blocks, fewer non-matmul ops; 2× faster than v1 on A100, similar on H100.
  • FlashAttention-3 (Shah, Dao et al. 2024) — Hopper-specific: uses Tensor Cores asynchronously via WGMMA, exploits TMA (Tensor Memory Accelerator), supports FP8. ~1.5-2× faster than v2 on H100, ~75% of H100 peak FLOPs.
  • FlashAttention-4 (2025) — Blackwell-specific, integrates with Blackwell’s fifth-generation Tensor Cores and FP4 paths.

FlashAttention is now the default attention backend in PyTorch (torch.nn.functional.scaled_dot_product_attention selects it automatically when shapes and dtypes are compatible), in xformers, in every serving engine, and in every training framework. Not using FlashAttention in 2026 is malpractice.

7. Quantization

Quantization replaces fp16 or bf16 weights and/or activations with lower-precision integer or float formats, trading numerical fidelity for memory and bandwidth savings. Decode is memory-bound, so cutting weight precision is a direct latency win.

Weight-only quantization

Weights are stored quantized but de-quantized on the fly to fp16/bf16 for matrix multiplication with fp16 activations. The de-quantize cost is small compared to the bandwidth saved.

  • GPTQ (Frantar et al. 2022) — second-order Hessian-aware quantization, calibrated on a small dataset. 4-bit weights with minimal quality loss. Foundational; still widely used.
  • AWQ (Lin et al. 2023) — Activation-aware Weight Quantization. Identifies the ~1% of “salient” weight channels by activation magnitude and protects them, scaling others. Slightly better quality than GPTQ at 4-bit; faster inference kernels.
  • GGUF — the file format used by llama.cpp; supports k-quant variants (Q4_K_M, Q5_K_M, Q6_K, Q8_0) optimized for CPU and Apple Silicon.
  • bitsandbytes NF4 / FP4 (Dettmers 2023) — used in QLoRA; “Normal Float 4” format with a quantile-based bin distribution matched to weight distributions.
  • EXL2 (turboderp 2023) — mixed-bitwidth (2-8 bit per channel) calibration; fastest cuda kernels for consumer GPUs.
  • AutoRound (Intel 2024) — gradient-based rounding optimization, beats GPTQ/AWQ at extreme quantization (3-bit and below).

Weight + activation quantization (W4A4 / W8A8 / FP8)

Quantizing activations as well removes the de-quant overhead in the matmul itself and allows native low-precision Tensor Core paths.

  • W8A8 INT8 — int8 weights and activations. Production-grade since A100; SmoothQuant (Xiao 2023) calibrates activation outliers to keep accuracy.
  • W4A8 / W4A4 — research-grade in 2024-2025; production-grade in 2026 on Blackwell.
  • FP8 — IEEE-style 8-bit float, two variants: e4m3 (4 exponent bits, more precision) and e5m2 (5 exponent bits, more range). Native on NVIDIA Hopper Transformer Engine and AMD MI300X. Standard for both training and inference of frontier models since 2024.
  • FP4 / MX-FP4 — 4-bit float (e2m1) with per-block scale factors (MX = micro-scaled). Native on NVIDIA Blackwell B100/B200 and AMD MI355X (2025). Halves memory vs FP8 with carefully calibrated accuracy. Frontier inference moved to FP4 in 2025-2026.

Quantization techniques

  • SmoothQuant (Xiao 2023) — pre-scales weights and activations so that activation outliers (the main source of accuracy loss in W8A8) are migrated into weights, where they are easier to quantize.
  • SpinQuant (Liu 2024) — applies learned rotation matrices to weight and activation channels before quantization, smoothing the distribution.
  • QuaRot (Ashkboos 2024) — Hadamard rotations to suppress outliers; enables W4A4 with minimal loss.
  • OmniQuant (Shao 2023) — jointly learns quantization parameters for weights and activations.

Practical recommendation (2026)

For weight-only deployment on consumer hardware: AWQ 4-bit or EXL2 5-bit. For server GPU deployment: FP8 if on H100/H200/MI300X, FP4 if on B200/MI355X. For CPU + Apple Silicon: GGUF Q4_K_M or Q5_K_M.

8. Speculative decoding

Speculative decoding (Leviathan et al. 2023 ICML, Chen et al. 2023) is the canonical algorithmic trick for accelerating decode without quality loss. The idea:

  1. A small, fast draft model autoregressively proposes K candidate tokens.
  2. The large target model scores all K candidates in parallel — one forward pass, since the candidates form a known sequence.
  3. The candidates are accepted token-by-token using a rejection-sampling rule that provably preserves the target model’s distribution.
  4. On the first rejection, sample one corrective token from the target and discard the remaining draft tokens.

The expected speedup is determined by the acceptance rate α: roughly (1−α^(K+1)) / ((1−α)(K·c+1)) where c is the draft/target cost ratio. With α ≈ 0.7 and a draft 20× cheaper than the target, speedups of 2-3× are typical at no quality loss.

Variants:

  • Medusa (Cai et al. 2024) — instead of a separate draft model, fine-tune K extra LM heads on top of the target model that predict tokens at offsets 1..K simultaneously. Avoids the separate-model deployment problem. ~2× speedup typical.
  • EAGLE / EAGLE-2 (Li et al. 2024) — autoregressive draft head operating on the target’s feature space rather than the output space; higher acceptance rate. EAGLE-2 adds dynamic draft tree search. ~3-4× speedups reported.
  • Lookahead Decoding (Fu et al. 2024) — uses n-gram statistics from previous decoded tokens to propose drafts; no separate draft model needed.
  • Self-Speculative Decoding — use early-exit predictions of the target model itself as drafts; cheap because no extra model.
  • Speculative Streaming (Bhendawade 2024) — fuses draft and target in a single model with multi-stream attention.

Speculative decoding is the default in vLLM, TensorRT-LLM, and SGLang. Acceptance rates are workload-dependent: high (0.8+) for predictable text like code completion, lower (0.5-0.7) for creative or out-of-distribution generation.

Practical deployment considerations

Speculative decoding helps latency-bounded single-stream inference the most. At very high batch sizes the target model is already memory-bound at near-saturation, and the additional parallel verification work consumes the spare compute that speculation hoped to fill — so speedups shrink. Production rules of thumb in 2026:

  • For batch size 1-4 (chat, code completion, agents): speculative decoding is a 2-4× win and should be on by default.
  • For batch size 8-32 (medium-load serving): speculative gains shrink to 1.3-1.8×.
  • For batch size 64+ (saturated serving): speculative decoding rarely helps and can hurt if the draft is expensive.

The draft model is itself a tuning surface. Common choices: a same-family small model (e.g., Llama 3.2 1B as draft for Llama 3.1 70B), a distilled draft trained on target outputs (Medusa heads), or even a quantized version of the target itself (self-draft). Each has different acceptance-rate / cost profiles.

Speculation interacts strangely with constrained decoding (JSON-mode, grammar-constrained generation, regex masks). The draft model does not know the grammar, so it proposes tokens that the target would reject not for distribution reasons but for grammar reasons. Acceptance rates collapse. Modern stacks (SGLang, vLLM 0.6+) integrate the grammar mask into the draft proposal step to recover acceptance, but it remains a sharp edge.

9. Parallelism for inference

A 70B model in fp16 needs 140 GB of weights; an H100 has 80 GB. A 405B or 671B model needs hundreds of GB. Multi-GPU serving is unavoidable for frontier models. The four parallelism dimensions:

Tensor parallelism (TP)

Shard each matrix-multiply across GPUs. For a [M, K] @ [K, N] → [M, N] matmul, partition the K dimension across G GPUs; each GPU computes a partial product, then an all-reduce sums them. Megatron-LM (Shoeybi 2019) is the canonical implementation.

TP communicates every layer (twice per Transformer block — once after attention, once after MLP), so the all-reduce latency is amplified. Practical only within a single server where NVLink provides 900 GB/s (H100) or 1800 GB/s (B200) interconnect. Cross-server TP over InfiniBand is too slow.

Typical TP degree: 4 on a 4×H100 box, 8 on an 8×H100 box, up to 72 on a GB200 NVL72 rack (effective intra-domain NVLink).

Pipeline parallelism (PP)

Split the model layer-wise across GPUs, with each GPU holding a contiguous slice (e.g., layers 0-19 on GPU 0, 20-39 on GPU 1). Activations flow forward through the pipeline; one micro-batch at a time.

Communication is once per micro-batch boundary, much lighter than TP. Cross-server PP over InfiniBand or RoCE is practical. The downside is pipeline bubble: at the start and end of a batch, some stages are idle. Best for large batch sizes that amortize the bubble.

PP is used to span servers in frontier deployment; combined with TP within each server. For DeepSeek-V3 671B on 64 H100s, a typical config is TP=8 within a node, PP=8 across nodes.

Expert parallelism (EP)

For Mixture-of-Experts models, route each token’s experts to the GPUs that hold them. The all-to-all routing communication is the main cost; high-bandwidth interconnect mandatory. DeepEP (DeepSeek’s expert-parallel library, 2024) and NVIDIA Megatron-Core support EP.

Modern MoE models (Mixtral 8x22B, DeepSeek-V3, Qwen MoE, Llama 4 Maverick) all use EP at serving time. Combined with TP and PP, EP is the third dimension of a 3D parallelism grid.

Sequence parallelism (SP)

Split the sequence dimension across GPUs for long-context cases where activation memory dominates. Reduces activation memory by the SP degree. Often combined with TP — TP across heads, SP across sequence positions. Ring Attention (Liu et al. 2023) and DeepSpeed Ulysses are the main implementations.

Disaggregated prefill + decode

DistServe (Patel et al. 2024) and Splitwise (Microsoft 2024) observed that prefill (compute-bound) and decode (memory-bound) have opposing optimal hardware configurations. They split the serving fleet into two pools:

  • Prefill pool — large batches, optimized for FLOPs; can use older or compute-strong accelerators.
  • Decode pool — many concurrent streams with continuous batching, optimized for memory bandwidth; benefits from HBM-rich accelerators like MI300X or H200.

After prefill completes, the KV cache is transferred from the prefill pool to the decode pool over a fast interconnect. Reported throughput gains: 2-4× over a unified serving stack at the same hardware budget.

By 2026, DistServe-style disaggregation is standard at hyperscale (OpenAI, Anthropic, DeepInfra, Together) and becoming available in open-source servers (vLLM v1, SGLang).

10. Serving stacks (2026)

vLLM

Berkeley’s vLLM (Kwon et al. 2023 SOSP) is the most widely used open-source inference engine. Native PagedAttention, continuous batching, multi-LoRA serving (LoRAX-style), prefix caching, speculative decoding, FP8 / FP4 quantization, multi-modal (vision-language) support, and a growing distributed deployment story (vLLM v1, 2025). The OpenAI-compatible HTTP API made it the default backend for almost every self-hosted LLM stack.

TGI (Text Generation Inference)

HuggingFace’s production server, in-house since 2022 and powering Hugging Chat, IBM watsonx, and many enterprise deployments. Continuous batching, FlashAttention, GPTQ/AWQ/EETQ quantization, tensor parallelism, custom Rust networking stack. Less feature-rich than vLLM in 2026 but very robust.

TensorRT-LLM

NVIDIA’s optimized engine using TensorRT compiler under the hood. Fastest absolute performance on NVIDIA hardware (H100/H200/B200), particularly for FP8/FP4 paths and large-batch decode. Complex setup — requires model-specific engine builds and tuning. Includes in-flight batching, paged KV, FlashAttention-3, speculative decoding.

SGLang

Zheng et al. 2023; primarily a Berkeley/Stanford project. Optimized for structured generation, agent workflows, and chained LLM calls. RadixAttention prefix-sharing cache, continuous batching, speculative decoding, DeepSeek-V3 first-class support. Frequently the fastest open server for chat workloads with shared system prompts.

MAX / Mojo

Modular Inc.’s engine (Chris Lattner et al., 2023+). Mojo-implemented kernel stack; first serving engine targeting CPUs and GPUs from a single high-level codebase. As of 2026 still smaller market share than vLLM but technically interesting for its Python-superset programming model.

llama.cpp / ollama / MLX

llama.cpp (Georgi Gerganov 2023+) — pure C/C++ inference with GGUF model files, CPU-first, supports CUDA, Metal, Vulkan, ROCm backends. The hobbyist and edge-deployment standard.

ollama — user-friendly wrapper around llama.cpp; the default way most developers run local LLMs.

MLX (Apple 2023+) — Apple Silicon native, lazy evaluation, unified-memory-aware. The fastest local-inference stack on M-series Macs.

LMDeploy / Xinference / DeepSpeed-MII

LMDeploy (Shanghai AI Lab 2023+) — TurboMind kernels, China-popular, strong on Qwen and InternLM.

Xinference — multi-engine orchestrator (wraps vLLM, llama.cpp, transformers).

DeepSpeed-MII — Microsoft’s inference fork of DeepSpeed; continuous batching, ZeRO-Inference. Used internally at Azure.

Managed API providers (2026)

  • Fireworks AI — fast and cheap open-model serving, custom speculative decoding stack.
  • Together AI — broad model catalog, OpenAI-compatible API, dedicated endpoints.
  • Anyscale — Ray Serve based, enterprise focus.
  • Replicate / Modal / Banana — serverless GPU inference, cold-start optimized.
  • Cerebras Inference — wafer-scale chip, fastest tokens/sec for medium models.
  • Groq Cloud — LPU-based, sub-100-ms latency for chat-size models.
  • SambaNova Cloud — reconfigurable RDU, large-model focus.
  • OpenRouter — aggregator/router across many providers.
  • AWS Bedrock / Azure AI / Google Vertex — hyperscaler offerings; multi-model brokered access.

11. Hardware

NVIDIA

  • A100 (Ampere, 2020) — 40/80 GB HBM2e, 2.0 TB/s bandwidth, 312 TFLOPs FP16. Still ubiquitous in training and inference fleets in 2026 for non-frontier workloads.
  • H100 (Hopper, 2022) — 80 GB HBM3, 3.35 TB/s, 989 TFLOPs FP16, 1979 TFLOPs FP8 (with sparsity). Transformer Engine introduces native FP8. The 2023-2025 workhorse.
  • H200 (Hopper refresh, 2024) — 141 GB HBM3e, 4.8 TB/s. Same compute as H100 but ~1.7× memory and 1.4× bandwidth; huge win for memory-bound decode.
  • B100 / B200 (Blackwell, 2025) — dual-die package, 192 GB HBM3e, 8.0 TB/s aggregate, 4.5 PFLOPs FP8, 9 PFLOPs FP4. Native FP4 Tensor Cores; fifth-generation NVLink at 1.8 TB/s.
  • GB200 NVL72 — rack-scale unit: 72 B200 GPUs + 36 Grace CPUs, all connected by NVLink Switch into a single coherent 13.8 TB HBM domain. Designed for trillion-parameter inference.
  • GB300 / Rubin (2026) — next-generation refresh; HBM4 with 12 TB/s per GPU, FP4/FP6 mixed.

AMD

  • MI250X (CDNA2, 2021) — 128 GB HBM2e, 3.2 TB/s. Frontier supercomputer base.
  • MI300X (CDNA3, 2023) — 192 GB HBM3, 5.3 TB/s, 1.3 PFLOPs FP16. Native FP8. The first AMD chip widely deployed for LLM inference (Microsoft Azure, Meta, Oracle).
  • MI325X (2024) — 288 GB HBM3e, 6.0 TB/s. Memory-record for a single accelerator at launch.
  • MI355X (CDNA4, 2025) — 288 GB HBM3e, 8 TB/s, native FP4 + FP6. Direct B200 competitor.
  • MI400 / MI500 (2026 roadmap) — rack-scale with Infinity Fabric over Ethernet for cross-server coherence.

Google TPU

  • TPU v4 (2020-2022) — 32 GB HBM, 1.2 TB/s, 275 TFLOPs bf16. PaLM training chip.
  • TPU v5e (2023) — cost-optimized inference variant; 16 GB HBM.
  • TPU v5p (2023) — flagship training chip; 95 GB HBM, 2.7 TB/s.
  • TPU v6 Trillium (2024) — 4.7× v5e per-chip performance, 32 GB HBM3.
  • TPU v7 Ironwood (2025) — Google’s first inference-optimized TPU; 192 GB HBM3e per chip, 7 TB/s, designed for “age of inference” workloads.

Specialty accelerators

  • Cerebras CS-3 (2024) — wafer-scale; 900,000 cores, 44 GB on-die SRAM, 21 PB/s on-chip memory bandwidth. Cerebras Inference offers 70B serving at thousands of tok/s — the SRAM-resident architecture eliminates HBM bottleneck entirely. Limited by total memory: model must fit in on-chip SRAM (with MemoryX off-chip weight streaming for very large models).
  • Groq LPU (Language Processing Unit) — fully SRAM-resident, deterministic execution, no batching needed. Hundreds of LPUs chained for a 70B model. Lowest single-stream latency in the industry (500+ tok/s on 70B).
  • SambaNova SN40L (2024) — Reconfigurable Dataflow Unit with three-tier memory: 520 MB on-chip SRAM, 64 GB HBM, 1.5 TB DDR. Targets very large models with weight streaming.
  • Tenstorrent Wormhole / Blackhole — RISC-V + Tensix cores, scalable mesh, open-source software stack.
  • Apple Neural Engine + MLX — M3 Max, M4 Max, M4 Ultra Mac systems run Llama 3 70B in 4-bit at 5-12 tok/s; unified memory architecture means CPU + GPU + NE share the same RAM.

Hardware selection by workload

  • Frontier-model serving at scale — B200 / GB200, H200, MI355X.
  • Memory-bound long-context decode — H200, MI325X, MI355X, Ironwood.
  • Cost-optimized open-model serving — H100, MI300X, A100 (still cheap on spot).
  • Lowest latency single-stream chat — Groq, Cerebras, SambaNova.
  • Edge / on-device — Apple Silicon + MLX, Qualcomm AI Hub (Hexagon NPU), Intel Lunar Lake NPU.

Interconnect matters as much as the chip

A common procurement mistake is to compare GPUs on per-chip TFLOPs or HBM bandwidth alone, ignoring the interconnect topology that determines whether a multi-chip deployment can sustain tensor-parallel inference. Concretely:

  • NVLink 4 (H100): 900 GB/s per GPU, all-to-all within an 8-GPU server.
  • NVLink 5 (B200): 1800 GB/s per GPU; NVLink Switch extends coherent NVLink across a 72-GPU rack.
  • Infinity Fabric (MI300X): 896 GB/s per GPU intra-server; cross-server is PCIe 5 or Ethernet/InfiniBand at the network speed.
  • InfiniBand NDR (400 Gb/s = 50 GB/s): typical cross-server interconnect; orders of magnitude slower than intra-server NVLink.
  • TPU ICI (Inter-Chip Interconnect): 4.8 Tb/s aggregate per chip on v5p; 3D-torus topology scales to TPU pods of thousands.

The interconnect determines maximum tensor-parallel degree, which determines maximum model size that can be served with reasonable latency. An 8×H100 server can TP=8 a 70B model comfortably; trying to TP across two servers over InfiniBand collapses latency.

12. Optimizations not covered above

State-space models and hybrids

  • Mamba / Mamba-2 (Gu + Dao 2023, Dao + Gu 2024) — selective state-space models with linear-time inference; constant KV-like memory.
  • Jamba (AI21 2024) — Transformer + Mamba hybrid; long context with low memory.
  • Mixture-of-Recursions (2024) — recursive blocks with state reuse for inference efficiency.
  • RWKV — RNN-style architecture with Transformer-quality results; trivially constant memory at inference.

Mixture of Depths

Raposo et al. 2024 — each layer’s router decides which tokens get the expensive full computation and which take a cheap skip path. Conditional compute that targets inference cost, not just parameter count like MoE.

Prompt caching at the application layer

In addition to per-server RadixAttention, providers offer prompt caching as a billing-level feature: cache the prefix server-side for a few minutes, charge ~10% of input cost on cache hits. Anthropic’s prompt caching (2024), OpenAI’s prompt caching (2024), Google Gemini’s context caching, and DeepInfra are all live. Critical for agent loops where the same system prompt and tool definitions repeat thousands of times per session.

Compilation and code generation

  • torch.compile (PyTorch 2.0+) — TorchDynamo + TorchInductor; produces fused kernels.
  • XLA (Google) — TensorFlow/JAX compiler, default for TPU.
  • Triton (OpenAI 2021+) — Python-DSL GPU kernel programming; FlashAttention is implemented in Triton.
  • MLIR — multi-level IR underlying MAX, Mojo, and several other stacks.
  • CUTLASS / cuBLASLt (NVIDIA) — template kernels for matmul; fp8/fp4 paths.

Multi-LoRA serving

Serve many fine-tuned LoRA adapters from a single base-model deployment, swapping adapter weights per request. LoRAX, vLLM multi-LoRA, and S-LoRA (Sheng 2023) enable this with negligible overhead — critical for personalization-at-scale.

Chunked prefill

Splitting a long prefill into chunks of fixed token budget (e.g., 512 tokens at a time) and interleaving the chunks with decode steps of other requests in the same batch. Smooths the latency profile so long-prefill requests do not starve short-decode requests of the GPU. Default in vLLM 0.4+ and SGLang.

Disaggregated KV cache transfer

When prefill and decode pools are physically separated, the KV cache must move from prefill GPUs to decode GPUs. Naive transfer over PCIe is slow; production systems use NVLink-C2C, NVSHMEM, RDMA over InfiniBand, or direct GPUDirect Storage paths. Mooncake (Moonshot AI 2024) and TetriInfer (Tencent 2024) are notable open implementations of disaggregated KV transfer.

Attention quantization (not just weights)

QServe (Lin 2024) and Atom (Zhao 2024) extend INT4/INT8 quantization to the attention computation itself, not just weights and KV cache. Particularly valuable on B200’s FP4 paths where attention can natively execute at FP4.

Tree-based speculative decoding

Instead of a linear K-token draft, propose a tree of candidate continuations and verify the entire tree in one batched forward. Medusa, EAGLE-2, and Sequoia (Chen 2024) use tree drafts. Higher acceptance per verification but more compute per step; net win at small batch sizes.

13. Cost reality (2026)

Per-million-token pricing benchmarks

  • Open-weight 70B class (LLaMA 3.3 70B, Qwen 2.5 72B): $0.50-3 per million tokens via Together, Fireworks, DeepInfra, Replicate.
  • Open-weight large MoE (DeepSeek-V3 671B-A37B, Llama 4 Maverick): 1-3 output.
  • Frontier closed models: Claude Opus 4 / GPT-5 class — 10-75 per million output tokens.
  • Frontier with caching: input cost drops to $0.10-1 on cache hits (10-15× cheaper than uncached input).
  • Specialty providers: Groq and Cerebras run 70B at ~$0.20-0.60 per million tokens with sub-second TTFT.

Self-hosted economics

Operating an 8×H100 server at hyperscaler list price (~0.20-0.40 per million tokens fully loaded, plus engineering overhead. Self-hosting beats managed APIs only at sustained high traffic (millions of tokens/hour 24/7).

Edge inference

  • MacBook M3 Max — Llama 3 70B 4-bit at ~5 tok/s; usable for personal tools.
  • MacBook M4 Max — Llama 3 70B 4-bit at ~10 tok/s; 8B class models at 60+ tok/s.
  • iPhone 15 Pro / 16 Pro — 3B class at 15-25 tok/s via MLX or Core ML.
  • NVIDIA Jetson AGX Orin — 7B class at 30 tok/s; robotics inference.

The cost of intelligence has dropped ~10× per year from 2020 through 2025. The 2026 trajectory is slowing as Moore’s law in HBM bandwidth saturates and as frontier models grow, but algorithmic gains (MoE, FP4, speculative decoding) continue to compensate.

Operational cost breakdown for a serving fleet

For an organization running an in-house inference fleet, the all-in unit cost decomposes into several layers:

  • Hardware amortization — list price of GPU servers divided by useful life (typically 3-4 years for production deployment). An 8×H100 server at 11.40/hr.
  • Power — H100 draws 700 W; B200 draws 1200 W. An 8×B200 server pulls ~10 kW continuously, ~7,000 kWh/month, $700-1,400/month at industrial rates.
  • Cooling — typical PUE of 1.3-1.5 means 30-50% additional energy for cooling. Direct liquid cooling (mandatory for B200 NVL72) recovers some of this but raises capex.
  • Datacenter footprint — racks, network gear, redundant power, on-call staff. Typically 20-30% of hardware cost.
  • Engineering — model deployment, monitoring, incident response. Often the largest line item for small fleets.

The total above is roughly 1.5-2.5× raw hardware list price for self-managed fleets. Managed inference providers can offer near-hardware-cost pricing because they amortize the engineering and operations layer across many tenants.

Why managed APIs persist despite open weights

Even with cheap open weights (Llama, Qwen, DeepSeek, Mistral) and mature serving stacks (vLLM, SGLang, TGI), managed APIs continue to dominate consumption in 2026 because:

  1. Bursty traffic dominates real-world LLM use; self-hosting requires capacity for peak, which sits idle most of the time.
  2. Frontier closed models (Claude, GPT-5, Gemini) are not available open-weight and remain the quality leader for many tasks.
  3. Engineering cost of running an inference fleet is substantial: monitoring, autoscaling, model updates, GPU failure handling, quantization tuning.
  4. Prompt caching, batching, and routing are easier to centralize at provider scale than implement per-customer.

Self-hosting wins when traffic is high and steady (millions of tokens per hour sustained), when data residency or compliance demands on-premises, or when the open model is good enough that brand-name model quality does not justify the price gap.

14. Pitfalls

  1. Optimizing FLOPs when the bottleneck is HBM bandwidth. Decode is memory-bound; FLOPs optimization (e.g., layer fusion, better matmul tilings) buys nothing on the critical path. Profile arithmetic intensity before optimizing.

  2. Forgetting to use FlashAttention. Hand-rolled attention loses 2-4× wall-clock and runs out of memory on long contexts. Always set the SDPA backend to flash or use xformers/FlashAttention directly.

  3. Batch size too small. A batch of one is roughly 0.3% of an H100’s compute utility on decode. Continuous batching with at least batch 32-64 is needed to approach economic sense; managed providers run at batch 100+.

  4. Long-tail latency from KV cache fragmentation. Static cache allocation causes high tail latency as memory fragments. PagedAttention eliminates this — use a server that has it (vLLM, SGLang, TRT-LLM, TGI).

  5. Lossy quantization without calibration. Naive round-to-int8 weights or activations destroys quality. Always calibrate (GPTQ, AWQ, SmoothQuant) on representative data; evaluate on a held-out benchmark before deploying.

  6. Tokenizer or chat-template mismatch. Loading a fine-tuned model with the wrong tokenizer or applying the wrong chat template (Llama-style versus ChatML versus Mistral instruct) silently degrades quality without erroring. Verify by encoding a known prompt and comparing token IDs to the model card reference.

  7. Prompt cache miss when context order changes. Cache hits require the exact same prefix bytes. Sorting tools, reordering examples, or even changing whitespace causes a miss and full re-prefill. Stabilize prefixes and put dynamic content at the end of the prompt.

  8. Over-provisioning for peak load. Inference traffic is bursty; sustained at peak rarely. Use autoscaling, dedicated + spot mixes, or burst to a managed API rather than holding idle capacity.

  9. Treating multi-GPU as a single GPU. TP/PP/EP each have communication overhead; naive multi-GPU deployment can be slower than fitting the model on one GPU with quantization.

  10. Speculative decoding without coverage measurement. Draft acceptance rate varies hugely by workload. A speculative pair tuned for code completion can be slower than non-speculative on creative writing. Measure on your actual traffic.

  11. Forgetting that prefill cost grows with context length squared (without FlashAttention) or linearly (with). A 100k-token prefill can take many seconds even on H100; consider chunked prefill, sequence parallelism, or context caching.

  12. Mixing precisions per-layer without testing. Mixed fp16 / fp8 / fp4 across attention vs MLP can produce subtle accuracy regressions. Always run an end-to-end eval after a quantization config change.

  13. Ignoring tokenizer-level cost asymmetries. Some languages (Chinese, Japanese, Korean, Arabic) tokenize at much lower byte-per-token rates than English in tokenizers trained on English-heavy corpora. A “1000 token” budget can mean 700 characters in English but only 250 characters in Chinese. Use tokenizer-aware budgeting in multilingual deployments, and prefer multilingual tokenizers (Qwen, Aya, Gemma) for non-English-primary workloads.

  14. Cold-start latency in serverless inference. Loading a 70B model into GPU memory takes 30-120 seconds; serverless platforms (Modal, Replicate, Banana) pay this cost on every cold start. For latency-sensitive APIs, pre-warm replicas or accept the trade-off explicitly. Pinning a managed-endpoint replica permanently has lower amortized cost above a few hundred requests per hour.

  15. Ignoring eviction-policy interaction with retries. When the KV cache is full, vLLM and similar servers evict the oldest in-flight request to make room. If a client retries on timeout, the retry triggers a full prefill, doubling work. Set per-request TTLs, configure preemption-friendly clients, and monitor preemption rates as a first-class SLO.

  16. Conflating throughput benchmarks with production behavior. Vendor-reported tokens/sec almost always refer to peak throughput at maximum batch with short prompts. Production has long-tail prompts, mixed batch sizes, and partial cache utilization. Benchmark on your own traffic distribution before sizing capacity.

15. Cross-references

16. Citations

  • Dao, Fu, Ermon, Rudra, Re. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
  • Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. 2023.
  • Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision. 2024.
  • Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
  • Yu, Jeong, Kim, Chun. Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022.
  • Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. 2019.
  • Ainslie, Lee-Thorp, de Jong, Zemlyanskiy, Lebron, Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023.
  • DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. 2024.
  • DeepSeek-AI. DeepSeek-V3 Technical Report. 2024.
  • Leviathan, Kalman, Matias. Fast Inference from Transformers via Speculative Decoding. ICML 2023.
  • Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper. Accelerating Large Language Model Decoding with Speculative Sampling. 2023.
  • Cai, Li, Geng, Peng, Lee, Chen, Dao. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. 2024.
  • Li, Wei, Zhang, Zhang. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. ICML 2024.
  • Frantar, Ashkboos, Hoefler, Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023.
  • Lin, Tang, Tang, Yang, Chen, Wang, Xiao, Dang, Gan, Han. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024.
  • Xiao, Lin, Seznec, Wu, Demouth, Han. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML 2023.
  • Xiao, Tian, Chen, Han, Lewis. Efficient Streaming Language Models with Attention Sinks. ICLR 2024.
  • Liu, Zeng, He, Tsvetkov, Lee, Yang, Du, Tang, Wang. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. ICML 2024.
  • Hooper, Kim, Mohammadzadeh, Mahoney, Shao, Keutzer, Gholami. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. NeurIPS 2024.
  • Zheng, Yin, Xie, Sun, Huang, Yu, Cao, Kozyrakis, Stoica, Gonzalez, Barrett, Sheng. SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS 2024.
  • Patel, Choukse, Zhang, Shah, Goiri, Maleki, Bianchini. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. ISCA 2024.
  • DistServe (Zhong, Liu, Chen, Hu, Zhuo, Liu, Jin, Zhang). DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. OSDI 2024.
  • Shoeybi, Patwary, Puri, LeGresley, Casper, Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. 2019.
  • Dettmers, Pagnoni, Holtzman, Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.
  • Gu, Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. 2023.
  • Dao, Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024.
  • Raposo, Ritter, Richemond, Cai, Henaff, Pascanu, Mikolov, Banino. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. 2024.
  • Sheng, Cao, Li, Hooper, Ho, Stoica, Gonzalez. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. MLSys 2024.
  • NVIDIA. Hopper Architecture Whitepaper. 2022.
  • NVIDIA. Blackwell Architecture Whitepaper. 2024.
  • AMD. CDNA3 Architecture Whitepaper (MI300 series). 2023.
  • AMD. CDNA4 Architecture Whitepaper (MI355X). 2025.
  • Google. TPU v5p / Ironwood System Architecture. 2023 / 2025.
  • Cerebras. Cerebras CS-3 and Inference Architecture Whitepaper. 2024.
  • Groq. LPU Architecture and Deterministic Execution Model. 2024.