Transformer Architecture — Compute Reference
1. At a glance
The Transformer is the neural-network architecture introduced by Vaswani et al. 2017 in the paper Attention Is All You Need (NeurIPS 2017, arXiv:1706.03762). It replaced recurrent neural networks (RNNs) and long short-term memory networks (LSTMs, Hochreiter + Schmidhuber 1997) as the dominant sequence-modelling architecture in essentially every domain where sequences or sets of tokens are processed.
The core insight of Vaswani et al. 2017: dispense with recurrence and convolution entirely; rely solely on self-attention to model dependencies between input and output positions. The architecture parallelizes across the sequence dimension during training (unlike RNNs, which are sequential by construction), which made large-scale pre-training tractable on accelerators (TPU v2 in the original paper, then NVIDIA V100/A100/H100/B200, Google TPU v3/v4/v5/v6, AMD MI300/MI325 in subsequent years).
Three canonical variants — each defined by which halves of the original encoder-decoder are kept — now dominate distinct task families:
- Encoder-only — BERT (Devlin et al. 2018), RoBERTa (Liu et al. 2019), DeBERTa (He et al. 2020). Bidirectional attention; pre-trained with masked language modelling (MLM); classification, named-entity recognition (NER), extractive question answering, sentence embedding.
- Decoder-only — GPT-1/2/3/4 (Radford et al. 2018/2019, Brown et al. 2020, OpenAI 2023), LLaMA 1/2/3/4 (Touvron et al. 2023, Meta AI 2024-26), Mistral 7B / 8x7B / Large (Mistral AI 2023-26), Claude Opus/Sonnet/Haiku 4.x (Anthropic 2024-26), Gemini 1.x/2.x/3.x (Google DeepMind 2023-26), Qwen 2.5/3 (Alibaba 2024-26), DeepSeek-V3 / R1 (DeepSeek 2024-25). Causal attention; autoregressive next-token prediction; the mainstream paradigm for generative LLMs since 2020.
- Encoder-decoder — T5 (Raffel et al. 2019), BART (Lewis et al. 2019), mT5 (Xue et al. 2020), Whisper (Radford et al. 2022, OpenAI), NLLB (Meta 2022). Encoder reads the input; decoder generates output and cross-attends back to encoder hidden states. Strong for translation, summarization, speech-to-text.
Transformers now dominate far beyond text:
- Vision — Vision Transformer (ViT, Dosovitskiy et al. 2020), Swin Transformer (Liu et al. 2021), DINOv2 (Oquab et al. 2023), SAM (Kirillov et al. 2023).
- Audio — Whisper (Radford et al. 2022), AudioLM (Borsos et al. 2022), MusicLM (Agostinelli et al. 2023), Wav2Vec 2.0 (Baevski et al. 2020).
- Protein structure — AlphaFold-2 (Jumper et al. 2021, DeepMind) and AlphaFold-3 (Abramson et al. 2024, DeepMind), which extended to nucleic acids and ligands.
- Multimodal — CLIP (Radford et al. 2021), Flamingo (Alayrac et al. 2022), GPT-4o (OpenAI 2024), Gemini (Google 2023-26), Claude 4.x vision (Anthropic 2024-26), LLaVA (Liu et al. 2023).
- Robotics / control — RT-1 / RT-2 / RT-X (Google DeepMind 2022-23), Octo (Octo Model Team 2024), Open-X-Embodiment (2023).
By 2026 the Transformer (with modifications like RoPE, GQA, FlashAttention, and Mixture-of-Experts) is the substrate of essentially every frontier model — closed (GPT-5, Claude Opus 4.x, Gemini 3.x) and open-weight (LLaMA 4, Qwen 3, DeepSeek-V3/R1, Mistral Large 3).
2. First principles — attention
The scaled dot-product attention operation is the atomic unit of the Transformer. Given three matrices Q (queries), K (keys), V (values), each obtained by a learned linear projection of the input X:
Q = X W_Q
K = X W_K
V = X W_V
attention is computed as:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
where d_k is the dimensionality of each key vector. The scaling factor 1 / sqrt(d_k) (Vaswani et al. 2017) prevents the dot-product magnitudes from growing with d_k, which would push the softmax into regions of vanishing gradients.
Geometric interpretation
For a single query vector q and N key vectors k_1, …, k_N: the dot product q · k_i measures similarity. The softmax over (q · k_1, …, q · k_N) / sqrt(d_k) produces a probability distribution over keys. The output is the corresponding convex combination of value vectors v_i. Intuitively, the model retrieves a weighted summary of V, weighted by how well each K matches Q.
Self-attention vs cross-attention
- Self-attention — Q, K, V are all projections of the same input X. Each token in a sequence attends to every other token (or, in the causal case, every earlier token). This is the workhorse inside both encoder and decoder layers.
- Cross-attention — Q comes from the decoder hidden state; K and V come from the encoder output. Found in encoder-decoder architectures (T5, BART, Whisper) and in some multimodal decoders that cross-attend over image/audio features.
Masking
In a decoder, causal masking sets the (i, j) entry of the pre-softmax attention matrix to -infinity (or a large negative number, e.g. -1e9) for j > i, so the softmax assigns zero probability to future positions. This preserves the autoregressive property: token t’s representation depends only on tokens 1..t. In an encoder there is no causal mask; bidirectional context is allowed.
A separate padding mask zeroes out attention to padding tokens regardless of position; this is universal in batched training.
3. Multi-head attention
A single attention head learns one projection of Q, K, V. Multi-head attention (MHA) runs h heads in parallel, each with its own learned W_Q^(i), W_K^(i), W_V^(i):
head_i = Attention(X W_Q^(i), X W_K^(i), X W_V^(i))
MultiHead(X) = Concat(head_1, ..., head_h) W_O
The model dimension d_model is split across heads: each head operates on d_model / h dimensions. Standard configurations (Vaswani et al. 2017, scaled up by Brown et al. 2020 / Touvron et al. 2023):
| Model | d_model | heads | d_head |
|---|---|---|---|
| Transformer base (Vaswani 2017) | 512 | 8 | 64 |
| BERT-base (Devlin 2018) | 768 | 12 | 64 |
| GPT-2 small (Radford 2019) | 768 | 12 | 64 |
| GPT-3 175B (Brown 2020) | 12288 | 96 | 128 |
| LLaMA 2 70B (Touvron 2023) | 8192 | 64 | 128 |
| LLaMA 3 70B (Meta 2024) | 8192 | 64 | 128 |
Why multiple heads
Empirically, different heads specialize on distinct linguistic / structural relations: some heads track syntactic dependencies (subject-verb agreement, coreference), some track positional patterns (e.g. attending to the previous token), some track semantic similarity (Clark et al. 2019 “What Does BERT Look At?”, Voita et al. 2019). Multi-head is a cheap form of ensembling within a single layer.
MQA and GQA
A practical concern: at inference time the KV cache (Section 10) grows with the number of heads. Multi-Query Attention (MQA) (Shazeer 2019) shares a single K and V across all heads; only Q is per-head. This dramatically reduces KV-cache memory but degrades quality. Grouped-Query Attention (GQA) (Ainslie et al. 2023) groups heads — e.g. 8 KV groups for 64 query heads — recovering most of MHA’s quality at near-MQA cost. LLaMA 2 70B onward, LLaMA 3, Mistral, and Gemma all use GQA. Discussed further in Section 7.
4. Positional encoding
Self-attention is permutation-equivariant: shuffle the input tokens, and the output shuffles correspondingly. Position information must be injected separately.
Sinusoidal (Vaswani et al. 2017)
The original paper used fixed sinusoidal encodings at exponentially-spaced frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
added directly to the input embedding. The geometric idea: for any fixed offset k, PE(pos + k) is a linear function of PE(pos), so the model can in principle learn to attend by relative position. Not learned; extrapolates to longer sequences in theory but in practice degrades quickly past the training length.
Learned absolute (GPT-2, BERT)
A trainable embedding table indexed by position 0, 1, …, L_max. Simple, works well within the training range, but cannot extrapolate at all past L_max. Used in BERT (Devlin et al. 2018) and GPT-2 (Radford et al. 2019).
Relative position (Shaw et al. 2018, T5)
Rather than adding to the input, relative position adds a bias to the attention logits based on the relative offset (i - j). T5 (Raffel et al. 2019) uses a simplified bucketed variant. Better extrapolation than absolute; standard in T5 and some BERT successors.
RoPE — Rotary Position Embedding (Su et al. 2021)
The dominant choice for modern LLMs (2022+). Encodes position by rotating Q and K vectors by an angle proportional to position, in pairs of dimensions. Mathematically:
RoPE(x, pos)[2i:2i+2] = R(pos * theta_i) @ x[2i:2i+2]
where R is a 2D rotation matrix and theta_i are the same exponentially-spaced frequencies as Vaswani 2017. Key property: the dot product q · k after applying RoPE depends only on the relative offset between the two positions, recovering the relative-position property naturally.
Adopted by GPT-NeoX (Black et al. 2022), PaLM (Chowdhery et al. 2022), LLaMA / LLaMA 2 / LLaMA 3 (Touvron et al. 2023, Meta 2024), Mistral, Mixtral, Qwen, Gemma, DeepSeek-V3. By 2024 RoPE is the de facto standard in decoder-only LLMs.
Extensions for longer context: Position Interpolation (PI) (Chen et al. 2023, Meta), NTK-aware scaling (community 2023), YaRN (Peng et al. 2023). These rescale the RoPE base frequency to extend usable context far beyond the training length without retraining from scratch.
ALiBi — Attention with Linear Biases (Press et al. 2021)
Adds a fixed (non-learned) linear bias to attention scores that penalizes attention to distant positions. No input-side encoding at all. Excellent extrapolation properties (trains at 1024, runs at 8192+). Used in BLOOM (BigScience 2022) and MPT (MosaicML 2023). Less common in 2024+ than RoPE but still selected for its extrapolation behavior.
5. Building block — one Transformer layer
A canonical Transformer layer (decoder-style, pre-LN, 2024-era) consists of two sub-blocks, each wrapped in a residual connection:
x = x + MHA(LN(x))
x = x + FFN(LN(x))
Sub-block 1 — Multi-head attention
As described in Section 3. In the encoder, full bidirectional attention. In the decoder, causal mask. In encoder-decoder cross-attention (the second sub-block of an encoder-decoder decoder layer), Q from the decoder, K + V from the encoder output.
Sub-block 2 — Feed-forward network (FFN / MLP)
A two-layer MLP applied independently to each position:
FFN(x) = activation(x W_1) W_2
with the inner dimension typically 4x d_model (Vaswani 2017). For d_model = 768 that’s 3072; for d_model = 8192 (LLaMA 70B) that’s 28672, though LLaMA’s exact ratios differ.
Activation choices:
- ReLU (Nair + Hinton 2010) — original Vaswani 2017.
- GELU (Hendrycks + Gimpel 2016) — BERT, GPT-2/3.
- SwiGLU (Shazeer 2020) — gated linear unit with swish/SiLU gating. Standard in LLaMA 2/3, Mistral, Mixtral, Gemma, Qwen. Adds a third weight matrix (gate, up, down) for a small parameter cost but consistently better loss.
- GeGLU (Shazeer 2020) — GELU-gated variant; used in PaLM (Chowdhery et al. 2022).
Normalization — LayerNorm placement
LayerNorm (Ba et al. 2016) normalizes activations across the feature dimension per token (unlike BatchNorm which normalizes across the batch dimension). Two placements:
- Post-LN (original Vaswani 2017):
x = LN(x + Sublayer(x)). Slightly better final quality at small scale; unstable at large scale — requires careful learning-rate warmup. - Pre-LN (Xiong et al. 2020, Baevski + Auli 2018):
x = x + Sublayer(LN(x)). More stable training, easier to scale up, slightly worse final loss at small scale but dominates at large scale. Standard in GPT-2 onward, BERT-large variants, LLaMA, virtually all decoder-only models 2020+.
A common refinement is RMSNorm (Zhang + Sennrich 2019), which omits the mean-subtraction step of LayerNorm; cheaper and comparable quality. Used in LLaMA 1/2/3, Mistral, T5.
Residual connections
Every sub-block is wrapped: x = x + Sublayer(LN(x)). Residuals (He et al. 2016 for ResNet) are critical for trainability at depth — without them gradients vanish quickly through dozens of layers. They also give the model a default identity behavior that learning can refine.
Embeddings + output projection
- Token embedding — lookup table of size (vocab_size, d_model). Common vocab sizes: 32k (LLaMA, T5 SentencePiece), 50k (GPT-2 BPE), 128k (LLaMA 3 BPE, GPT-4o).
- Output projection — linear layer (d_model, vocab_size). In most decoder-only models the input embedding and output projection share weights (tied embeddings, Press + Wolf 2017), saving vocab_size * d_model parameters.
6. Three architectures in detail
Encoder-only
- Architecture — stack of N encoder layers, each = bidirectional MHA + FFN, both with pre-LN (or post-LN in original BERT).
- Training objective — masked language modelling (MLM): randomly mask 15% of tokens, predict them from bidirectional context. BERT also trained next-sentence prediction (NSP); RoBERTa (Liu et al. 2019) dropped NSP and improved.
- Models — BERT (Devlin et al. 2018), RoBERTa (Liu et al. 2019), DeBERTa / DeBERTaV3 (He et al. 2020/2021), ELECTRA (Clark et al. 2020), XLM-R (Conneau et al. 2019).
- Uses — classification, NER, extractive QA, sentence/passage embeddings (Sentence-BERT, Reimers + Gurevych 2019; modern alternatives: BGE, GTE, E5, Cohere embed-v3).
- Why not generative — bidirectional attention means the model has seen the future; it cannot generate autoregressively in a single forward pass.
Decoder-only — the 2024+ mainstream
- Architecture — stack of N decoder layers, each = causal MHA + FFN. No cross-attention. Pre-LN + RMSNorm + SwiGLU + RoPE + GQA is the modern recipe (LLaMA 3, Mistral, Qwen 2.5).
- Training objective — next-token prediction: at each position, predict the next token from prior context. Equivalent to maximum-likelihood on an autoregressive factorization of P(x_1, …, x_n).
- Models — GPT-1 (Radford et al. 2018), GPT-2 (Radford et al. 2019), GPT-3 (Brown et al. 2020), GPT-4 (OpenAI 2023), GPT-4o / 4.1 / 5 (OpenAI 2024-26); Claude 1/2/3/4.x (Anthropic 2023-26); LLaMA / 2 / 3 / 4 (Meta 2023-26); Mistral 7B / 8x7B Mixtral / Large 1-3 (Mistral 2023-26); Gemini 1/2/3 (Google 2023-26); Qwen 1.5 / 2.5 / 3 (Alibaba 2023-26); DeepSeek-V2/V3/R1 (DeepSeek 2024-25); Phi-2/3/4 (Microsoft 2023-26); Gemma 1/2/3 (Google 2024-26).
- Why it dominates — generation is a first-class operation, and scaling decoder-only on raw text produces models that handle understanding tasks competitively via in-context learning (Brown et al. 2020) or instruction tuning (Wei et al. 2021, FLAN).
Encoder-decoder
- Architecture — encoder stack (bidirectional) + decoder stack (causal self-attention + cross-attention to encoder outputs).
- Training objectives — varied: T5 uses span-corruption (Raffel et al. 2019); BART uses denoising (Lewis et al. 2019); Whisper uses sequence-to-sequence supervised audio-to-text (Radford et al. 2022).
- Models — T5 / Flan-T5 / mT5 (Raffel et al. 2019, Xue et al. 2020, Chung et al. 2022); BART (Lewis et al. 2019); MarianMT; NLLB-200 (Meta 2022); Whisper (Radford et al. 2022); Donut (Kim et al. 2021, document understanding).
- When preferred — when input and output are clearly distinct sequences (machine translation, summarization, speech-to-text). The encoder gets bidirectional context over the input; the decoder generates and cross-attends as needed.
7. Modern variants and improvements
FlashAttention (Dao et al. 2022, 2023; Shah et al. 2024)
The standard softmax(QK^T / sqrt(d)) V computation, naively implemented, materializes the N x N attention matrix in HBM (high-bandwidth memory), making it both memory-bound and O(N^2) in memory. FlashAttention is an exact, IO-aware reformulation: tile Q, K, V into blocks that fit in SRAM, run softmax online with running max + sum, and never materialize the full attention matrix.
- FlashAttention v1 (Dao et al. 2022) — 2-4x wall-clock speedup on A100, O(N) memory.
- FlashAttention-2 (Dao 2023) — better work partitioning, additional 2x.
- FlashAttention-3 (Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao 2024) — Hopper-specific (H100), uses TMA (Tensor Memory Accelerator) + WGMMA + FP8, near 75% of theoretical FLOPs.
Standard in PyTorch 2.x via torch.nn.functional.scaled_dot_product_attention (which dispatches to FlashAttention or efficient memory backend automatically). Required for any modern long-context training.
MQA / GQA (Shazeer 2019; Ainslie et al. 2023)
See Section 3. GQA is the modern default — LLaMA 2 70B / LLaMA 3 / Mistral / Gemma / Qwen all use it. Quality nearly indistinguishable from full MHA; KV-cache memory typically reduced 4-8x.
Mixture of Experts (MoE)
A subset of FFN parameters is activated per token via a learned router; the rest stay dormant. This lets total parameter count grow far beyond what dense computation could afford.
- Switch Transformer (Fedus, Zoph, Shazeer 2021, Google) — one expert per token (k=1).
- GShard (Lepikhin et al. 2020).
- GLaM (Du et al. 2021, Google) — sparse 1.2T params.
- Mixtral 8x7B (Mistral AI 2023) — open-weight sparse MoE, k=2 (top-2 of 8 experts per token), ~13B active params from ~47B total.
- DeepSeek-V2 / V3 (DeepSeek 2024-25) — fine-grained MoE with 256 experts, top-8 routed + shared experts.
- GPT-4 — widely reported (not confirmed by OpenAI) to be MoE.
- Mixtral 8x22B, Grok-1 (xAI 2024), DBRX (Databricks / MosaicML 2024), Snowflake Arctic (2024).
Trade-offs: more parameters at fixed inference FLOPs; harder to train (load-balancing losses, all-to-all communication during distributed training); higher memory footprint at inference.
SSMs and Mamba (Gu + Dao 2023)
State Space Models (SSMs) parameterize a continuous-time linear dynamical system and apply it to sequences. Mamba (Gu + Dao 2023) introduced selective SSMs where state-update parameters depend on the input, making the model input-conditional like attention but in O(N) compute and memory.
- Mamba (Gu + Dao 2023) — competitive with Transformer at small/medium scale; sub-quadratic in sequence.
- Mamba-2 (Dao + Gu 2024) — re-derives the SSM as a structured attention variant, unifying the two paradigms.
- Jamba (AI21 Labs 2024) — Transformer + Mamba hybrid, 256k context, MoE.
- Zamba (Zyphra 2024) and other hybrids.
As of 2026 Mamba-family models are competitive but have not displaced Transformers at the frontier; the practical wins are in long-context efficiency.
Linear and sparse attention
Earlier attempts to break the O(N^2) attention complexity:
- Linformer (Wang et al. 2020) — low-rank approximation of attention matrix.
- Performer (Choromanski et al. 2020) — random-feature kernel approximation of softmax attention.
- Longformer (Beltagy, Peters, Cohan 2020) — sparse attention pattern (local + global tokens).
- BigBird (Zaheer et al. 2020) — random + window + global sparse pattern.
- Reformer (Kitaev, Kaiser, Levskaya 2020) — locality-sensitive hashing for attention.
Mostly superseded by FlashAttention + GQA + RoPE-scaled context for moderate context lengths, and by Mamba / hybrid architectures for very long context.
Mixture of Depths (Raposo et al. 2024, DeepMind)
Per-token learned routing decides which tokens skip a Transformer block. Reduces FLOPs per layer; complementary to MoE (depth + width sparsity).
Speculative decoding
A small “draft” model proposes K tokens; the large model verifies them in a single forward pass, accepting the longest verified prefix.
- Speculative decoding (Leviathan, Kalman, Matias 2023, Google; Chen et al. 2023, DeepMind).
- Medusa (Cai et al. 2024) — multiple decoding heads on the same model predict next-K tokens.
- Lookahead decoding (Fu, Bailis, Stoica, Zhang 2024).
- EAGLE (Li et al. 2024) — trained draft head leveraging hidden states.
2-3x decode speedup typical for chat workloads; integrated in vLLM, TGI, TensorRT-LLM.
Quantization
- GPTQ (Frantar, Ashkboos, Hoefler, Alistarh 2022) — post-training 4-bit weight quantization with second-order error correction.
- AWQ (Lin et al. 2023, MIT) — activation-aware weight quantization.
- GGUF / llama.cpp (Gerganov 2023+) — wide range of quantization formats (Q4_K_M, Q5_K_M, Q8_0) widely used on CPU + Apple Silicon.
- FP8 — natively supported on NVIDIA H100 (Hopper) and B100/B200 (Blackwell); both training (with stochastic rounding / per-tensor scaling) and inference.
- FP4 / NVFP4 — Blackwell B200 / GB200; inference-only at frontier scale.
- bitsandbytes (Dettmers et al. 2022) — 8-bit and 4-bit (NF4) quantization library; standard in HuggingFace.
Long context
Original Transformer training context was 512 (Vaswani 2017) or 1024 (GPT-2). 2024-26 frontier context windows:
- Mistral 7B / Mixtral — 32k native.
- LLaMA 3 — 8k native, 128k via continued training.
- LLaMA 3.1/3.2/3.3 — 128k.
- Claude 3 / 3.5 / 4.x — 200k+.
- Gemini 1.5 Pro — 1M; Gemini 1.5 / 2 / 3 up to 2M tokens via Mixture-of-Depths + custom infrastructure.
- GPT-4 Turbo / 4o / 4.1 — 128k+ (some variants 1M).
Enabling tech: RoPE scaling (PI, YaRN, NTK-aware), Ring Attention (Liu, Zaharia, Abbeel 2023), continued pre-training on long documents, careful eval (needle-in-a-haystack, RULER).
8. Scaling laws
Kaplan et al. 2020 (OpenAI)
Scaling Laws for Neural Language Models established that cross-entropy loss is a smooth power law in model parameters (N), dataset size (D), and compute (C):
L(N) ~ (N_c / N)^alpha_N
L(D) ~ (D_c / D)^alpha_D
L(C) ~ (C_c / C)^alpha_C
with exponents around 0.05-0.10 and crisp curves over many orders of magnitude. This justified the 175B-parameter GPT-3 (Brown et al. 2020) — Kaplan’s analysis suggested most of the available compute budget should go to bigger models with comparatively less data.
Chinchilla — Hoffmann et al. 2022 (DeepMind)
Training Compute-Optimal Large Language Models re-estimated the scaling laws and found that for a fixed compute budget, model size and dataset size should scale roughly equally — both ~ C^0.5. Specifically, the compute-optimal token-to-parameter ratio is about 20 tokens per parameter. Gopher (280B, 300B tokens) was severely under-trained; Chinchilla (70B, 1.4T tokens) achieves better loss at the same compute, validating the new law.
Practical impact: 2022-2024 frontier models trained on far more tokens per parameter than GPT-3:
- LLaMA 1 7B / 65B — 1T / 1.4T tokens.
- LLaMA 2 7B / 70B — 2T tokens.
- LLaMA 3 8B / 70B — 15T tokens.
- LLaMA 3.1 405B — 15T+ tokens.
- DeepSeek-V3 671B (37B active) — 14.8T tokens.
LLaMA’s findings (Touvron et al. 2023) further argued that continued training past the Chinchilla-optimal point improves inference economics: a smaller model trained longer is cheaper to serve than a larger Chinchilla-optimal one with similar quality. This pushed open models toward smaller, longer-trained recipes.
Emergent abilities — Wei et al. 2022 (vs Schaeffer et al. 2023)
Wei, Tay, Bommasani, et al. 2022 Emergent Abilities of Large Language Models documented capabilities (arithmetic, multi-step reasoning, certain instructions) that are near-random at small scale and improve sharply past a threshold. Schaeffer, Miranda, Koyejo 2023 Are Emergent Abilities of LLMs a Mirage? argued these “phase transitions” are largely artifacts of discontinuous metrics (e.g. exact-match accuracy); smoother metrics show continuous improvement.
Both views are partially correct: continuous loss improvement is real, but downstream task curves do show inflection points that matter operationally.
9. Training mechanics
Pre-training
- Objective — next-token prediction (decoder-only) or MLM (encoder-only) or span corruption (T5-style enc-dec).
- Token budget — 1T (LLaMA 1, 2022) → 15T (LLaMA 3, 2024) → similar scale for DeepSeek-V3, Qwen 2.5, Gemma 2/3.
- Data — mix of web crawl (Common Crawl, RefinedWeb, FineWeb), code (StackOverflow, GitHub), books, academic (arXiv), curated dialog. Frontier closed labs additionally use licensed data and proprietary collections.
- Tokenizer — BPE (Sennrich et al. 2016) or SentencePiece (Kudo + Richardson 2018) trained on a representative sample. Vocab 32k-128k.
Compute
- Frontier 2024-26: 10^25 - 10^26 FLOPs.
- GPT-4 estimated ~2 × 10^25 FLOPs.
- GPT-5 / Gemini Ultra / Claude Opus 4.x not officially disclosed; likely 10^26.
- Hardware budgets: 16k - 100k+ accelerators (H100, B200, TPU v5p, Trillium v6).
- Wall-clock: 3-6 months for frontier pre-training; continuous post-training pipelines layered on top.
Optimizer
- AdamW (Loshchilov + Hutter 2019) — decoupled weight decay; the universal default.
- Lion (Chen, Hsieh, et al. 2023, Google) — sign-based; cheaper and competitive on some workloads.
- Sophia (Liu, Stanford 2023) — second-order; faster convergence claimed.
- AdaFactor (Shazeer + Stern 2018) — memory-efficient, used in T5.
Learning-rate schedule
- Linear warmup + cosine decay (Loshchilov + Hutter 2017) — standard.
- WSD (warmup-stable-decay, MiniCPM 2024) — flat plateau then short decay; allows continued training from intermediate checkpoints.
Numerical stability
- Mixed precision — bf16 (Google 2019, ubiquitous since A100) or fp16 with loss scaling.
- FP8 training — H100, B200; per-tensor scales tracked, stochastic rounding optional.
- Gradient clipping — typically L2 norm clipped at 1.0.
- z-loss — small auxiliary term penalizing softmax-logit magnitudes (PaLM, Chowdhery et al. 2022) to prevent numeric blowup.
Distributed training
- Data parallelism (DP) — replicate model, split batch. AllReduce on gradients each step.
- DDP / FSDP / ZeRO-3 — Fully Sharded Data Parallel (PyTorch FSDP / DeepSpeed ZeRO, Rajbhandari et al. 2020) shards optimizer states, gradients, and parameters across DP ranks. ZeRO-1 shards optimizer states only; ZeRO-2 adds gradients; ZeRO-3 adds parameters (full sharding, parameters gathered on-demand per layer via AllGather).
- Tensor parallelism (TP) — shard individual matmuls across GPUs (Megatron-LM, Shoeybi et al. 2019). Column-parallel for the up-projection of an MLP / Q/K/V projection; row-parallel for the down-projection. Bandwidth-hungry — practical only over NVLink / NVSwitch (within a node).
- Pipeline parallelism (PP) — assign different layers to different GPUs; micro-batches pipelined (GPipe, Huang et al. 2019; PipeDream, Narayanan et al. 2019; 1F1B / interleaved 1F1B in Megatron). Bubble overhead grows with the number of stages relative to micro-batches.
- Sequence parallelism / context parallelism — shard the sequence dimension; required for very long context. Ring Attention (Liu, Zaharia, Abbeel 2023), Striped Attention.
- Expert parallelism (EP) — shard MoE experts across devices; tokens are routed via all-to-all communication. Used in Switch Transformer and Mixtral training.
- 3D parallelism — combine TP × PP × DP, plus EP for MoE and CP for long context: “5D parallelism” in 2024+ frontier training.
Frameworks: Megatron-LM (NVIDIA, Shoeybi et al. 2019), DeepSpeed (Microsoft, Rajbhandari et al. 2020), PyTorch FSDP (Zhao et al. 2023), MaxText + MaxDiffusion (Google, JAX), MosaicML composer, Levanter (Stanford, JAX), Megatron-Core + NeMo (NVIDIA, production reference).
Post-training
After pre-training, frontier models go through one or more alignment stages:
- Supervised fine-tuning (SFT) — train on high-quality (prompt, response) pairs. InstructGPT (Ouyang et al. 2022), FLAN (Wei et al. 2021, Chung et al. 2022).
- RLHF — Reinforcement Learning from Human Feedback — Christiano, Leike, Brown, Martic, Legg, Amodei 2017 Deep RL from Human Preferences; Stiennon, Ouyang, et al. 2020 Learning to Summarize from Human Feedback; Ouyang et al. 2022 (InstructGPT). Train a reward model on human preference comparisons, then optimize the LLM with PPO (Schulman et al. 2017) against the reward + a KL penalty to the SFT model.
- DPO — Direct Preference Optimization (Rafailov, Sharma, Mitchell, Ermon, Manning, Finn 2023) — closed-form reformulation that bypasses the explicit reward model and PPO; treats the LLM itself as an implicit reward model and optimizes a classification-like loss on preference pairs.
- Constitutional AI / RLAIF (Bai et al. 2022, Anthropic) — replace human feedback with model feedback guided by a written constitution. Used by Anthropic for Claude.
- Newer variants — IPO (Identity Preference Optimization, Azar et al. 2023), KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2024), ORPO (Hong et al. 2024), SimPO (Meng, Xia, Chen 2024).
- Reasoning RL — DeepSeek-R1 (DeepSeek 2025) and OpenAI o1/o3/o4 use large-scale RL on verifiable rewards (math, code) to elicit chain-of-thought reasoning.
10. Inference optimizations
KV cache (essential)
During autoregressive generation, K and V for each new token are appended to a running cache; previously-computed K, V never change. Without a KV cache, generating N tokens is O(N^3); with it, O(N^2) (each step processes O(N) tokens). The cache size is roughly:
kv_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * dtype_bytes
For LLaMA 3 70B (80 layers, 8 KV heads, 128 head dim, bf16, 8k context per batch element): 2 × 80 × 8 × 128 × 8192 × 2 ≈ 2.7 GB per sequence. Multiply by batch and context length; KV cache dominates memory beyond a few thousand tokens. Mitigations: GQA / MQA, KV cache quantization (FP8, INT8, INT4), paged attention.
Continuous batching + PagedAttention (vLLM)
vLLM (Kwon, Li, Zhuang, et al. 2023, UC Berkeley) introduced PagedAttention — manage the KV cache like virtual memory pages instead of a flat per-sequence buffer. Enables continuous batching: requests join the batch as soon as a slot frees up rather than waiting for the slowest sequence in a static batch. Order-of-magnitude throughput improvement on real serving workloads. vLLM is now the de facto open-source inference server.
Speculative decoding
See Section 7. 2-3x decode latency reduction typical.
Tensor / pipeline parallelism (multi-GPU inference)
Same techniques as training, applied to inference. TP shards each matmul; PP shards layers. Latency-optimized serving usually picks TP up to the NVLink limit (typically 8 GPUs on a node) and avoids PP at small batch.
Disaggregated prefill + decode
Prefill (processing the prompt) is compute-bound; decode (generating one token at a time) is memory-bandwidth-bound. Disaggregated inference runs prefill on one GPU pool and decode on another, optimized differently. Mooncake (Moonshot AI 2024), DistServe (Zhong et al. 2024), Splitwise (Patel et al. 2023 Microsoft).
Stacks (2026)
- vLLM (Kwon et al. 2023) — most popular open-source server; PagedAttention; supports nearly every modern LLM.
- TGI — Text Generation Inference (HuggingFace) — production-grade open server, similar feature set.
- SGLang (Zheng, Yin, et al. 2024) — RadixAttention for prefix caching, structured generation, agent workflows.
- TensorRT-LLM (NVIDIA) — most optimized closed-source NVIDIA stack; FP8 + speculative + paged.
- llama.cpp (Gerganov 2023+) — CPU + Apple Silicon (Metal) + ROCm + CUDA; GGUF format; Apple MLX adapters.
- MLX (Apple 2023+) — Apple Silicon inference + training.
- JetStream (Google) — TPU inference.
11. Common pitfalls
- Forgetting the causal mask in a decoder. Trivial bug, catastrophic effect: the model can “see the answer” at training time and won’t learn to generate. Always verify with a known input that masked positions produce identical logits to a single-prefix forward pass.
- Mixing up Q, K, V dimensions. Q and K must have the same last dimension (
d_k) so the dot product is defined; V’s last dim (d_v) can differ but in practice equalsd_k. With multi-head,d_k = d_v = d_model / h. - LayerNorm placement. Post-LN at large scale leads to divergent training without aggressive warmup. Use pre-LN (or RMSNorm) for any model deeper than ~12 layers.
- Catastrophic forgetting during fine-tuning. Aggressive LR or too many epochs on a narrow distribution erases pre-training. Mitigate with low LR (1e-5 to 5e-5), LoRA (Hu et al. 2021) or QLoRA (Dettmers et al. 2023) instead of full fine-tune, replay buffer of pre-training data, or careful early stopping.
- Hallucination and ungroundedness. Pure LLM output is parametric — there is no inherent connection between generated text and external facts. RAG (Retrieval-Augmented Generation) (Lewis et al. 2020) injects retrieved documents into context; tool use / function calling enables verified actions; both reduce but do not eliminate hallucination.
- Context-length confusion with position-embedding range. Vanilla RoPE trained at 4k will degrade rapidly past 4k. Always check whether the deployed context window matches the position-encoding training range, or whether RoPE-scaling (PI, YaRN, NTK-aware) has been applied.
- KV-cache memory blow-up at long context. Use GQA, FP8 or INT4 KV cache, paged attention, and (for inference) prefix caching. At training time, gradient checkpointing + sequence parallelism + FlashAttention are required past 8k.
- Tokenization edge cases. Numbers, code, non-Latin scripts, and rare Unicode can produce surprising token sequences; arithmetic errors often trace back to tokenizer-induced digit fragmentation (LLaMA / GPT tokenizers; better tokenizers like cl100k_base / o200k_base group digit triples).
- Training/inference dtype mismatch. A model trained in bf16 and run in fp16 can quietly lose precision and produce different outputs; pick a deployment dtype and verify perplexity matches training.
- Wrong attention dropout for the deployment context. Most inference-stack reimplementations omit dropout (set p=0); custom training code should match exactly.
12. Vendors and ecosystem (2026)
Frontier closed-weight models
- OpenAI — GPT-4 / 4 Turbo / 4o / 4.1 / 5; o1 / o3 / o4 reasoning models. API + Microsoft Azure OpenAI.
- Anthropic — Claude 3 / 3.5 / 4 / 4.x (Opus, Sonnet, Haiku tiers). Constitutional-AI training; long context (200k+); strong tool use and computer use.
- Google DeepMind — Gemini 1.0 / 1.5 / 2.0 / 2.5 / 3 (Ultra, Pro, Flash, Nano). 1M-2M token context; native multimodal.
- Mistral AI — Mistral Large 1 / 2 / 3, Mistral Medium / Small, Mixtral 8x7B / 8x22B (open weights), Codestral.
- xAI — Grok-1 (open weights, MoE), Grok-2 / 3 / 4.
- Cohere — Command R / R+ / R7B; Embed v3 / v4; Rerank.
Open-weight models
- Meta — LLaMA 1 / 2 / 3 / 3.1 / 3.2 / 3.3 / 4 (8B / 70B / 405B tiers; multimodal in 3.2+).
- Mistral — Mistral 7B, Mixtral 8x7B / 8x22B.
- Alibaba — Qwen 1.5 / 2 / 2.5 / 3 (0.5B - 110B; coder, math, VL variants).
- DeepSeek — DeepSeek V2 / V3 (671B sparse MoE), DeepSeek R1 (reasoning).
- Microsoft — Phi-2 / 3 / 4 (small, high-quality).
- Google — Gemma 1 / 2 / 3 (open Gemini-architecture models).
- 01.AI — Yi 1.5 / 2.
- Stability AI / EleutherAI / BigScience — historical: GPT-J, GPT-NeoX, Pythia, BLOOM.
Training + inference frameworks
- PyTorch (Meta) — dominant; PyTorch 2.x with
torch.compile, FSDP, scaled_dot_product_attention. - HuggingFace Transformers — ubiquitous model hub + reference implementations.
- JAX + Flax (Google) — TPU-native; used by DeepMind, Google Research, MaxText, Levanter.
- PyTorch Lightning — training boilerplate.
- DeepSpeed (Microsoft) — ZeRO, MoE, inference engine.
- Megatron-LM + NeMo (NVIDIA) — large-scale TP/PP training.
- MaxText (Google) — JAX large-scale training reference on TPU.
- Composer (MosaicML / Databricks).
- TRL (HuggingFace) — RLHF / DPO / SFT.
- Axolotl, Unsloth — community fine-tuning stacks with LoRA / QLoRA.
Hardware
- NVIDIA — A100 (Ampere, 2020), H100 / H200 (Hopper, 2022), B100 / B200 / GB200 (Blackwell, 2024-25), with Rubin (R100/R200) on the 2026 roadmap. FP8 + FP4 + Transformer Engine.
- Google TPU — v4 / v5e / v5p (2023), Trillium v6 (2024), Axion v7 (2025-26).
- AMD — MI300X / MI325X (CDNA 3, 2023-24), MI355X / MI400 (CDNA 4, 2025-26). ROCm software stack.
- Cerebras — CS-2, CS-3 wafer-scale.
- Groq — LPU (Language Processing Unit), very-low-latency inference.
- Tenstorrent — Wormhole, Blackhole.
- AWS — Trainium / Inferentia (Neuron SDK).
- Apple — M-series + Neural Engine; MLX framework.
- HBM3 / HBM3e / HBM4 — memory substrate; supplied by SK Hynix, Samsung, Micron. The frontier bottleneck more than logic.
13. Cross-references
- _index
- rag-embeddings (TBD)
- fine-tuning-rlhf (TBD)
- linear-algebra (TBD)
- control-algorithms — foundation-models section (RT-2, Octo, Open-X-Embodiment).
- semiconductor-materials — HBM stacks, GPU substrates, silicon-photonics interconnect.
14. Citations
Foundational
- Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin 2017 — Attention Is All You Need. NeurIPS. arXiv:1706.03762.
- Ba, Kiros, Hinton 2016 — Layer Normalization. arXiv:1607.06450.
- He, Zhang, Ren, Sun 2016 — Deep Residual Learning for Image Recognition. CVPR.
- Hendrycks, Gimpel 2016 — Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
- Zhang, Sennrich 2019 — Root Mean Square Layer Normalization (RMSNorm). NeurIPS.
Encoder / decoder / enc-dec families
- Devlin, Chang, Lee, Toutanova 2018 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
- Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, Stoyanov 2019 — RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
- He, Liu, Gao, Chen 2020 — DeBERTa. ICLR 2021.
- Radford, Narasimhan, Salimans, Sutskever 2018 — Improving Language Understanding by Generative Pre-Training (GPT-1).
- Radford, Wu, Child, Luan, Amodei, Sutskever 2019 — Language Models are Unsupervised Multitask Learners (GPT-2).
- Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, et al. 2020 — Language Models are Few-Shot Learners (GPT-3). NeurIPS.
- OpenAI 2023 — GPT-4 Technical Report.
- Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, Rozière, Goyal, Hambro, et al. 2023 — LLaMA: Open and Efficient Foundation Language Models.
- Touvron, Martin, Stone, Albert, Almahairi, Babaei, Bashlykov, et al. 2023 — Llama 2: Open Foundation and Fine-Tuned Chat Models.
- Meta AI 2024 — The Llama 3 Herd of Models.
- Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu 2019 — Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). JMLR.
- Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, Zettlemoyer 2019 — BART.
- Radford, Kim, Xu, Brockman, McLeavey, Sutskever 2022 — Robust Speech Recognition via Large-Scale Weak Supervision (Whisper).
Position encodings
- Shaw, Uszkoreit, Vaswani 2018 — Self-Attention with Relative Position Representations. NAACL.
- Su, Lu, Pan, Murtadha, Wen, Liu 2021 — RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE). arXiv:2104.09864.
- Press, Smith, Lewis 2021 — Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (ALiBi). ICLR 2022.
- Chen, Wong, Chen, Tian 2023 — Extending Context Window of Large Language Models via Positional Interpolation. arXiv:2306.15595.
- Peng, Quesnelle, Fan, Shippole 2023 — YaRN: Efficient Context Window Extension of Large Language Models. arXiv:2309.00071.
Attention and architecture variants
- Dao, Fu, Ermon, Rudra, Ré 2022 — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS.
- Dao 2023 — FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
- Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao 2024 — FlashAttention-3.
- Shazeer 2019 — Fast Transformer Decoding: One Write-Head is All You Need (MQA).
- Ainslie, Lee-Thorp, de Jong, Zemlyanskiy, Lebrón, Sanghai 2023 — GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
- Shazeer 2020 — GLU Variants Improve Transformer (SwiGLU, GeGLU).
- Fedus, Zoph, Shazeer 2021 — Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
- Jiang, Sablayrolles, Roux, et al. (Mistral AI) 2024 — Mixtral of Experts.
- DeepSeek-AI 2024 — DeepSeek-V3 Technical Report.
- Gu, Dao 2023 — Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
- Dao, Gu 2024 — Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2).
- Raposo, Ritter, Richemond, Mensch, Hennigan 2024 — Mixture-of-Depths.
Scaling laws + emergence
- Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei 2020 — Scaling Laws for Neural Language Models. arXiv:2001.08361.
- Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, Casas, Hendricks, et al. 2022 — Training Compute-Optimal Large Language Models (Chinchilla). NeurIPS.
- Wei, Tay, Bommasani, et al. 2022 — Emergent Abilities of Large Language Models. TMLR.
- Schaeffer, Miranda, Koyejo 2023 — Are Emergent Abilities of Large Language Models a Mirage? NeurIPS.
Optimization + post-training
- Loshchilov, Hutter 2017 — SGDR: Stochastic Gradient Descent with Warm Restarts (cosine LR).
- Loshchilov, Hutter 2019 — Decoupled Weight Decay Regularization (AdamW). ICLR.
- Chen, Liang, Huang, Real, Wang, Liu, Lu, et al. 2023 — Symbolic Discovery of Optimization Algorithms (Lion).
- Rajbhandari, Rasley, Ruwase, He 2020 — ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.
- Shoeybi, Patwary, Puri, LeGresley, Casper, Catanzaro 2019 — Megatron-LM.
- Christiano, Leike, Brown, Martic, Legg, Amodei 2017 — Deep Reinforcement Learning from Human Preferences. NeurIPS.
- Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, Christiano 2020 — Learning to Summarize from Human Feedback. NeurIPS.
- Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, et al. 2022 — Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS.
- Bai, Kadavath, Kundu, Askell, Kernion, Jones, Chen, Goldie, et al. 2022 — Constitutional AI: Harmlessness from AI Feedback. Anthropic.
- Schulman, Wolski, Dhariwal, Radford, Klimov 2017 — Proximal Policy Optimization Algorithms (PPO).
- Rafailov, Sharma, Mitchell, Ermon, Manning, Finn 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO). NeurIPS.
- Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen 2021 — LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
- Dettmers, Pagnoni, Holtzman, Zettlemoyer 2023 — QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS.
Inference + serving
- Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica 2023 — Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). SOSP.
- Leviathan, Kalman, Matias 2023 — Fast Inference from Transformers via Speculative Decoding. ICML.
- Cai, Li, Geng, Peng, Lee, Chen, Dao 2024 — Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.
- Frantar, Ashkboos, Hoefler, Alistarh 2022 — GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
- Lin, Tang, Tang, Yang, Chen, Wang, Xiao, Dang, Gan, Han 2023 — AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.
- Zheng, Yin, Xie, Sun, Li, Lin, Huang, Chen, Gonzalez, Zaharia, Stoica, Sheng 2024 — SGLang: Efficient Execution of Structured Language Model Programs.
- Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel, Kiela 2020 — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG). NeurIPS.
Vision / multimodal / domain
- Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby 2020 — An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021.
- Liu, Lin, Cao, Hu, Wei, Zhang, Lin, Guo 2021 — Swin Transformer.
- Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever 2021 — Learning Transferable Visual Models From Natural Language Supervision (CLIP).
- Jumper, Evans, Pritzel, Green, et al. 2021 — Highly accurate protein structure prediction with AlphaFold. Nature.
- Abramson, Adler, Dunger, et al. 2024 — Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature.
Last updated 2026-05-16. Maintained as a Tier 1 deep reference; cross-references in Section 13 should be filled out as sibling notes land.