Fine-tuning & RLHF — Compute Reference

1. At a glance

Modern large language model (LLM) training is a three- or four-stage pipeline. A base transformer is first pre-trained on trillions of tokens of raw web / book / code text with a next-token cross-entropy objective. That base model has broad linguistic and world-knowledge competence but is poor at following instructions, refuses badly, makes up facts confidently, and emits the wrong format for a chat product. Post-training fixes that.

The canonical recipe since InstructGPT (Ouyang et al. 2022, NeurIPS) and the production deployment of ChatGPT, Claude, and Gemini is:

  1. Supervised fine-tuning (SFT) on human-written or distilled (instruction, response) pairs. Teaches format, persona, and basic helpfulness.
  2. Preference tuning on (prompt, chosen, rejected) pairs derived from human or model judges. Used to be PPO-RLHF with a separate reward model; since 2024 the field has largely moved to DPO and its variants because they are simpler, more stable, and competitive on quality.
  3. Optional reasoning RL (DeepSeek-R1, OpenAI o1/o3, Gemini Thinking) using verifiable rewards (math/code/proof correctness) with PPO-family algorithms like GRPO — this stage produced the 2024-25 step change in reasoning.
  4. Optional safety tuning (Constitutional AI, red-team RLAIF, refusal shaping) interleaved with the above.

The goal is alignment: the model should be helpful (do what the user asked), honest (don’t fabricate), harmless (don’t help with catastrophic misuse), and formatted (use the chat template the product expects). The trade-offs between these dimensions are real and are the entire reason post-training is hard.

This note is a working reference for the techniques, the libraries, the benchmarks, the cost envelopes, and the pitfalls — not a tutorial.


2. Supervised Fine-Tuning (SFT)

SFT is the workhorse. Conceptually it is the same teacher-forced next-token cross-entropy loss used in pre-training, but the data is now formatted as (prompt, response) pairs and the loss is usually masked to apply only to the response tokens — the model should not be penalized for predicting the prompt it was given. Optimization is on a much smaller corpus.

Data scale. Effective SFT corpora range from a few thousand high-quality examples (LIMA, Zhou et al. 2023, showed 1k carefully chosen examples produced a competitive chat model on a 65B base) to roughly one million for production-grade instruction tuning. Quality dominates quantity: a smaller corpus of carefully curated, diverse, high-skill responses outperforms a noisy million-example dump.

Canonical SFT datasets.

  • FLAN-v2 (Wei et al. 2022, Google) — academic-task instruction mixture, large scale, very influential.
  • OpenAssistant Conversations (LAION 2023) — crowdsourced multi-turn dialogue with human ratings.
  • Anthropic HH (Helpful + Harmless, Bai et al. 2022) — used both for SFT and as preference data.
  • GPT-4 distillation corpora — Alpaca, Vicuna ShareGPT, WizardLM, UltraChat, OpenOrca, distilled from frontier teachers. Legally murky (terms-of-service violation depending on the teacher) but ubiquitous in OSS.
  • NoRobots (HuggingFace 2023) — purely human-written, no model output.
  • Tulu series (AI2) — open, well-documented mixes.
  • Domain-specific: MetaMathQA, MAmmoTH (math), CodeAlpaca, Magicoder, OSS Instruct (code), MedInstruct, Bio-Instruct, etc.

Hyperparameters that matter for SFT.

  • Epochs: 1-3. More than 3 epochs on a small corpus generally overfits and produces a brittle model that parrots training responses.
  • Learning rate: 1e-5 to 5e-5 for full fine-tune of a transformer; LoRA can tolerate 1e-4 to 5e-4 because only a tiny number of parameters are being trained. Always lower than pre-train (which is typically 1e-4 to 3e-4 on the large models at the start of training, decayed to ~1e-5).
  • Schedule: cosine decay with a short linear warmup (3-10% of steps) is standard. Constant-with-warmup is acceptable for tiny fine-tunes.
  • Batch size: as large as the GPU allows. Gradient accumulation is used freely to reach an effective batch size of 32-256.
  • Sequence packing: pack multiple short examples into one sequence with attention masking between them. Roughly 2-4x throughput gain on instruction data, which is heavily right-skewed in length.
  • Loss masking: critical. Mask out the prompt tokens. Mask out padding. Mask out special tokens you don’t want the model to over-emit.
  • Chat template: the SFT data must be rendered into the exact chat template the model will serve with. A common, painful bug is to SFT in one template and deploy with another.

Diagnostics. A healthy SFT run shows training loss decreasing smoothly from roughly 1.5-2.5 down to roughly 0.5-1.0 over 1-3 epochs. Loss going below ~0.3 is a strong sign of memorization, especially on small corpora. Always hold out a validation split and a small held-out preference / chat-quality eval — loss is a poor proxy for the thing you care about.


3. Parameter-efficient fine-tuning (PEFT)

Full fine-tuning of a 70B model requires roughly 1.2 TB of GPU memory for optimizer state + gradients + activations + weights at bf16 with Adam. That is prohibitive outside a real cluster. PEFT methods freeze the base model and train a small additional set of parameters that perturb the base behavior. PEFT cuts memory by ~99%, cuts compute roughly proportionally, and — for most downstream tasks — recovers 95-100% of the full-fine-tune quality.

3.1 LoRA — Low-Rank Adaptation

LoRA (Hu et al. 2021, Microsoft) is the dominant PEFT method. For each chosen weight matrix W in R^(d_out × d_in), it learns two small matrices A in R^(r × d_in) and B in R^(d_out × r) such that the effective weight during fine-tuning is W + ΔW = W + B·A, with rank r ≪ min(d_in, d_out). Only A and B are trained; W is frozen.

Typical configuration:

  • Rank r: 4-64. Higher rank gives more capacity but more parameters. r=8 or r=16 is the most common starting point. Rank above 64 rarely helps for instruction-style fine-tunes.
  • Alpha (scaling factor): the actual update applied is (α / r) · B · A. Setting α = 2r is a common convention so the effective learning rate is decoupled from r.
  • Targets: which weight matrices to adapt. Common choice is all linear projections in attention (q_proj, k_proj, v_proj, o_proj). More aggressive recipes also target the MLP gate/up/down projections, which roughly doubles trainable parameters and is often worth it.
  • Dropout: 0.05 to 0.1 inside the LoRA adapters.
  • Initialization: A is initialized with Kaiming or similar; B is zero so that ΔW = 0 at start and the model behaves identically to the base.

A LoRA adapter for a 7B model is typically 10-200 MB on disk and can be swapped at inference time, which makes LoRA the basis of multi-tenant model serving (vLLM, SGLang, Predibase, LoRAX all support hot-swapping adapters).

3.2 QLoRA — Quantized LoRA

QLoRA (Dettmers et al. 2023, NeurIPS) made fine-tuning a 65B model feasible on a single A100 80GB. Three techniques stacked:

  1. 4-bit NormalFloat (NF4) quantization of the frozen base weights, with block-wise quantization constants.
  2. Double quantization of the quantization constants themselves, saving another ~0.4 bits per parameter on average.
  3. Paged optimizers that swap optimizer state to CPU memory during gradient checkpointing spikes, preventing out-of-memory crashes.

LoRA adapters are trained at higher precision (bf16) on top of the quantized base. The base stays frozen and quantized; only the LoRA matrices are updated. End-to-end quality matches 16-bit LoRA on most benchmarks.

QLoRA is the practical default for hobbyist and small-team fine-tuning. The bitsandbytes library is the canonical implementation; Hugging Face PEFT wraps it.

3.3 DoRA — Weight-Decomposed Low-Rank Adaptation

DoRA (Liu et al. 2024, NVIDIA) decomposes each weight into magnitude and direction, applies LoRA only to the direction, and learns the magnitude as a separate small vector. Reported to close most of the quality gap between LoRA and full fine-tune for the same parameter count. Supported in PEFT.

3.4 AdaLoRA

AdaLoRA (Zhang et al. 2023) starts with a uniform rank and adaptively re-allocates rank budget across layers using an importance score during training. Layers that need more capacity get more rank; others shrink. Useful when you have a tight parameter budget and don’t want to hand-tune r per layer.

3.5 Prefix tuning, P-tuning, Prompt tuning

A different family. Instead of perturbing the weight matrices, these methods prepend a small number of trainable continuous “soft prompt” vectors to the input (and sometimes to each transformer layer’s keys/values). The base model weights are entirely frozen.

  • Prompt tuning (Lester et al. 2021) — soft prompt at the embedding layer only. Works at very large scale (10B+) but underperforms LoRA on smaller models.
  • Prefix tuning (Li & Liang 2021) — soft prefix in the key-value cache of every layer. More capacity than prompt tuning.
  • P-tuning v2 (Liu et al. 2022) — refinement of prefix tuning with deep prompts and reparameterization tricks for training stability.

Soft-prompt methods have largely been displaced by LoRA in practice because LoRA is more stable and usually gives better quality at similar parameter counts.

3.6 (IA)^3 — Infused Adapter by Inhibiting and Amplifying Inner Activations

(IA)^3 (Liu et al. 2022) is even more parameter-efficient than LoRA. It learns three per-vector multiplicative scaling vectors — one each for keys, values, and the MLP intermediate activations — and uses them to rescale activations elementwise. Parameter count is on the order of 0.01% of the base model. Strong on T0-style multitask benchmarks; less common than LoRA today but useful when memory is extreme-constrained.

3.7 The PEFT library

Hugging Face’s peft library standardizes LoRA, QLoRA, DoRA, AdaLoRA, prefix tuning, prompt tuning, (IA)^3, and several newer methods behind a common interface. It integrates with transformers, trl, accelerate, and bitsandbytes. It is the de-facto OSS PEFT stack.


4. Instruction tuning

Instruction tuning is a special case of SFT where the data is explicitly formatted as (instruction, optional input, output). It dates back to T5’s multitask pretext (Raffel et al. 2020) and was made central by FLAN (Wei et al. 2022): the surprising finding was that fine-tuning on a broad mix of instruction-formatted tasks made a model dramatically better at generalizing to unseen instructions zero-shot.

Key datasets in chronological order:

  • FLAN, FLAN-T5, FLAN-v2 (Google, 2021-2022) — academic NLP tasks reformatted as instructions, ~1.8k task variants.
  • Self-Instruct (Wang et al. 2022) — bootstrap instructions from a small seed by prompting a strong LM; the technique behind Alpaca.
  • Alpaca (Taori et al. 2023, Stanford) — 52k self-instruct-generated examples distilled from GPT-3 (text-davinci-003). Cheap to fine-tune (~$100 of compute at the time) and kicked off the OSS instruction-tuning boom.
  • Vicuna / ShareGPT (LMSYS 2023) — real ChatGPT conversations scraped from share links.
  • WizardLM (Xu et al. 2023) — evol-instruct, recursively rewriting prompts to increase difficulty.
  • OpenAssistant (LAION 2023) — large, human-annotated, multi-turn, multi-lingual.
  • NoRobots (HuggingFace 2023) — 10k human-only examples, no LM output.
  • UltraFeedback / UltraChat (Cui et al. 2023, Tsinghua) — large GPT-4 distillation, both SFT and preference data.
  • Orca, Orca 2 (Mukherjee et al. 2023, Microsoft) — explanation-traced GPT-4 distillation.
  • Magpie (Xu et al. 2024) — self-synthesized instructions from a tuned model itself.
  • Tulu mix (AI2 2023-2024) — well-documented open recipe combining many of the above.

A general rule of thumb for production SFT mixes is roughly 80% general instruction data + 20% domain-specific. Heavier weighting on domain data tends to cause catastrophic forgetting of general capability; lighter weighting underdelivers on the target task.


5. Reward modeling

For PPO-RLHF you need a reward model (RM) that produces a scalar preference score for any (prompt, response) pair. The classical recipe:

  1. Start from the SFT model.
  2. Replace the LM head with a linear scalar head (one logit, no softmax).
  3. For each prompt, sample 4-10 completions from the SFT model.
  4. Show pairs of completions to human annotators (or to a model judge) and ask which is better.
  5. Fit the Bradley-Terry pairwise preference model (Bradley & Terry 1952): the probability that completion y_w is preferred over y_l is σ(r(x, y_w) − r(x, y_l)). Train the RM with the negative log-likelihood of the observed preferences.

The RM should be at least the size of the policy it scores, and ideally larger. Undersized RMs are easy to game in the RL phase. The RM is the most expensive and most brittle artifact in the RLHF pipeline; its mistakes show up as reward hacking downstream.

Open RM datasets: Anthropic HH (Bai et al. 2022), OpenAssistant rankings, UltraFeedback, HelpSteer (Wang et al. 2023, NVIDIA), Nectar (Berkeley), PKU SafeRLHF.


6. PPO RLHF

The InstructGPT / ChatGPT recipe (Ouyang et al. 2022; building on Stiennon et al. 2020’s summarization RLHF work and Christiano et al. 2017’s preference learning work) is:

  1. Sample a prompt x from a prompt distribution.
  2. Sample a completion y from the current policy π_θ.
  3. Score (x, y) with the reward model r(x, y).
  4. Add a per-token KL penalty against the reference (SFT) policy π_ref: reward at each token t is r(x, y) − β · log(π_θ(y_t | x, y_<t) / π_ref(y_t | x, y_<t)).
  5. Optimize π_θ with Proximal Policy Optimization (PPO, Schulman et al. 2017) on these advantages.

PPO’s clipped surrogate objective. PPO defines a probability ratio r_t(θ) = π_θ(a_t | s_t) / π_θ_old(a_t | s_t) and optimizes min(r_t · A_t, clip(r_t, 1 − ε, 1 + ε) · A_t), where A_t is an advantage estimate (usually GAE, Schulman et al. 2016) and ε is typically 0.1 to 0.3. The clip prevents the policy from moving too far in a single update.

KL regularization. β is the most important RLHF hyperparameter. β=0.05 to 0.5 is the practical range. Too small and the policy drifts into degenerate high-reward responses that exploit RM blind spots. Too large and the policy barely changes from SFT.

Failure modes.

  • Reward hacking — the policy finds tricks the RM rewards but humans hate: excessive hedging, refusal-shaped responses, formatted-but-empty answers, obsequious sycophancy.
  • Mode collapse — the policy converges to a narrow style and stops exploring.
  • Length bias — RMs trained on human comparisons tend to prefer longer responses; PPO amplifies this until responses are 3-5x longer than SFT.
  • Training instability — PPO is finicky; gradient explosions, policy divergence, and KL blowups are common.
  • Expense — every PPO step requires policy rollouts, RM forward passes, and reference-model forward passes for the KL term. End-to-end RLHF on a frontier model is 3-10x more compute than SFT.

Implementations.

  • HuggingFace TRL (von Werra et al. 2020+) — the most-used OSS RLHF stack; supports PPO, DPO, GRPO, KTO, ORPO.
  • trlX (CarperAI / Stability) — earlier OSS PPO library, now largely superseded by TRL.
  • DeepSpeed-Chat (Microsoft) — end-to-end SFT + RM + PPO pipeline with ZeRO-3.
  • NVIDIA NeMo-Aligner — Megatron-LM-backed RLHF for very large models.
  • OpenRLHF — Ray-based scalable RLHF (DeepSeek used a fork).
  • Verifiers / verl (ByteDance 2024) — modern RLHF/RLVR framework.

7. Direct Preference Optimization (DPO)

DPO (Rafailov et al. 2023, NeurIPS) was the field-shaking paper of 2023. The observation: given the closed-form solution of the KL-regularized RL problem that PPO is approximating, you can express the optimal policy directly in terms of the reward, and substitute that into the Bradley-Terry preference loss, eliminating both the explicit reward model and the PPO loop.

Concretely, DPO trains the policy π_θ directly on (prompt x, chosen y_w, rejected y_l) triples by minimizing

L_DPO(θ) = − E [ log σ( β · log(π_θ(y_w | x) / π_ref(y_w | x)) − β · log(π_θ(y_l | x) / π_ref(y_l | x)) ) ]

where π_ref is the frozen SFT model and β is the KL coefficient (typically 0.01 to 0.5). One loss, no RM, no rollouts, no PPO.

DPO is dramatically simpler than PPO, more stable, and competitive on most chat-quality benchmarks. It became the mainstream open-source preference- tuning method through 2024 and remains so today. Llama 3 (Meta 2024), Zephyr (HuggingFace 2023), and most open chat models use DPO or a close variant.

Caveats: DPO can over-shrink rejected probabilities, can mode-collapse to short responses, and is sensitive to the quality of π_ref. The slate of variants below addresses these.


8. Variants of preference optimization

The post-DPO field has produced a steady stream of refinements.

  • IPO — Identity Preference Optimization (Azar et al. 2023, DeepMind) — replaces the log-sigmoid in DPO with an L2 loss against a target preference margin, regularizing against overfit to noisy preferences.
  • KTO — Kahneman-Tversky Optimization (Ethayarajh et al. 2024) — drops the pairwise structure and uses only binary thumbs-up / thumbs-down labels, with a loss derived from prospect theory. Useful when you have labels but no paired comparisons (e.g., production thumbs-up data).
  • ORPO — Odds-Ratio Preference Optimization (Hong et al. 2024) — fuses SFT and preference optimization into a single stage with one loss; no reference model needed. Strong empirical results, fashionable in 2024-25 OSS recipes.
  • SimPO — Simple Preference Optimization (Meng et al. 2024) — removes the reference model entirely, using a length-normalized margin objective. Lower memory, comparable quality.
  • CPO — Contrastive Preference Optimization (Xu et al. 2024) — combines DPO with an SFT term on the chosen response.
  • NCA — Noise Contrastive Alignment (Chen et al. 2024) — re-derives preference optimization from noise-contrastive estimation; better-behaved gradients than DPO on noisy data.
  • GRPO — Group Relative Policy Optimization (Shao et al. 2024, DeepSeek) — a PPO variant that estimates the baseline as the mean reward over a group of K completions from the same prompt, eliminating the value/critic network entirely. Halves the memory footprint of PPO and is the algorithm behind DeepSeek-Math and DeepSeek-R1.

Practical guidance: for general chat alignment, DPO or ORPO are the standard starting points. For verifiable-reward reasoning tasks (math, code), GRPO is the workhorse. KTO is the right tool when you only have unpaired binary feedback.


9. Constitutional AI

Constitutional AI (Bai et al. 2022, Anthropic) is a two-phase recipe that replaces most of the human-labeled preference work with model self-critique against a written constitution.

SL phase (Critique-Revise). Given a prompt that might elicit harmful output, sample an initial response from a helpful-only model. Then prompt the same model to critique its response against a constitutional principle (e.g., “identify ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal”). Then prompt it to revise. The revised response replaces the original in the SFT corpus. Iterate.

RL phase (RLAIF — RL from AI Feedback). Generate paired responses to prompts; use the model itself (or a separate judge model) to rank the pair against constitutional principles; train an RM on these AI preferences; PPO or DPO against the resulting RM.

Anthropic published a constitution that combines the UN Universal Declaration of Human Rights with Apple’s Terms of Service and a handful of inline principles. The exact text matters less than the process. The technique dramatically reduces the human label cost of safety tuning and produces models that can articulate why they refused.

Subsequent work (collective Constitutional AI, Anthropic 2023) crowdsourced the constitution itself.


10. RLAIF and LLM-as-judge

RLAIF (RL from AI Feedback) generalizes Constitutional AI’s RL phase: use a capable model in place of a human annotator to produce preference labels. Lee et al. 2023 (Google) showed RLAIF reaches comparable quality to RLHF on summarization at a fraction of the labeling cost.

Caveat: bias compounding. The judge model’s biases (verbosity, position, self-preference, formality preference) propagate into the RM and then into the trained policy. Mitigations include using a stronger judge than the policy, randomizing pair order, multiple judges with majority vote, and length-normalizing rewards.

LLM-as-judge for evaluation. Distinct from RLAIF but related: using a strong LM (typically GPT-4 or Claude) to compare two model outputs and pick a winner is now standard for cheap, fast benchmarking (MT-Bench, AlpacaEval). Zheng et al. 2023 (LMSYS) showed LLM-judge agreement with humans is roughly 80%, comparable to inter-human agreement.


11. Reasoning RL — the 2024-25 step change

Through 2024 and into 2025, the highest-impact post-training development was applying RL with verifiable rewards (RLVR) to elicit chain-of-thought reasoning in math, code, and proof.

OpenAI o1 (announced Sep 2024, Strawberry / Q*) was the first widely- deployed reasoning model. DeepSeek-R1 (released Jan 2025) was the first fully open reasoning model and published a technique paper with the recipe. Google Gemini 2.0 Flash Thinking (Dec 2024) and Anthropic Claude 3.7 Sonnet (Feb 2025) followed. OpenAI o3 (Apr 2025) extended the recipe.

Recipe (DeepSeek-R1).

  1. Cold-start SFT on a small set of high-quality chain-of-thought traces.
  2. GRPO with rule-based verifiable rewards — math problems with known numeric answers, code problems with unit tests, proof problems with formal verifiers. No reward model. The model is rewarded only for producing a correct final answer (and lightly for format).
  3. The model spontaneously learns to produce long internal chains of thought, try multiple approaches, backtrack, and self-correct.
  4. A second SFT + RLHF pass on the resulting model to add general helpfulness, safety, and format compliance.

Process reward models (PRM) vs outcome reward models (ORM). PRMs (Lightman et al. 2023, OpenAI — “Let’s Verify Step by Step”) score each reasoning step; ORMs score only the final answer. PRMs are more sample-efficient but require step-level labels; ORMs are cheap to get from automated verifiers. DeepSeek-R1 showed that ORMs with sufficient scale work surprisingly well.

Inference-time test-time compute scaling (long chain-of-thought, self-consistency, best-of-N, MCTS, tree-of-thought) trades inference compute for accuracy and is the deployment counterpart to reasoning RL training.


12. Distillation

Distillation trains a smaller “student” model to imitate a larger “teacher” model. Two flavors:

  • Hard-label distillation — generate (prompt, response) pairs from the teacher and SFT the student on them. This is what Alpaca, Vicuna, Orca, WizardLM, and most modern “small open” models did. Simple, effective, and the basis of the entire OSS distillation ecosystem.
  • Soft-label / response-based distillation — for each token, match the teacher’s full probability distribution (or a temperature-softened version) rather than just the argmax. Captures more information per example but requires logit access. Hinton et al. 2015 introduced the soft-label idea.
  • Feature distillation — match intermediate representations. DistilBERT (Sanh et al. 2019) was an early classic.

Modern examples: DistilBERT, DistilGPT2 (HuggingFace 2019); LLaMA 3.2 1B/3B (Meta 2024) distilled from the 8B/70B/405B family; Gemma 2 2B/9B (Google 2024) distilled using on-policy distillation against a larger teacher; MiniLM, Phi-3 mini (Microsoft). Frontier-lab distillation of their own large models into deployable small models is now standard.

Legal note: distilling commercial APIs violates their ToS. Distilling permissively-licensed open models is fine.


13. Mixture-of-Experts fine-tuning

MoE models (Mixtral 8x7B, Mixtral 8x22B, DeepSeek-V2/V3, Qwen MoE, Snowflake Arctic, Databricks DBRX, Grok-1) sparsely activate only a few experts per token. Fine-tuning them is harder than dense models:

  • Routing collapse — fine-tuning can cause all tokens to route to a small subset of experts. Auxiliary load-balancing loss is critical.
  • Expert specialization — domain fine-tuning sometimes benefits from fine-tuning only a subset of experts.
  • LoRA on MoE — apply LoRA to the router, to expert FFNs, or both. Most PEFT libraries now support MoE-aware LoRA.

DeepSeek-V3 (Dec 2024) and the subsequent reasoning RL on R1 demonstrated that MoE models scale through post-training fine, but the engineering effort is higher than for dense models.


14. Continued pretraining

Before SFT, you can do an additional pass of next-token-prediction training on domain-specific raw text. This is called continued pretraining or domain-adaptive pretraining.

Examples:

  • BioMedLM / Med-PaLM (Stanford / Google) — continued PT on PubMed.
  • Code LLaMA (Meta 2023) — Llama 2 continued PT on code.
  • DeepSeek-Coder, StarCoder — code-focused models.
  • Bloomberg GPT — finance.
  • Llemma (EleutherAI 2023) — math.

Recipe: typically 50B-500B tokens of domain data, learning rate around 1/10 of the original pretrain peak, optionally mixed with general data to avoid catastrophic forgetting (usually 10-30% general data in the mix).

Continued pretraining followed by SFT followed by preference tuning is the full pipeline for high-stakes domain-specific deployments.


15. Safety and red-teaming

Safety tuning is not optional for any consumer-facing deployment. It is typically interleaved with the helpfulness post-training above.

Datasets.

  • Anthropic HH-RLHF — paired helpful + harmless preference data.
  • PKU SafeRLHF — safety-annotated preference data.
  • HarmBench (Mazeika et al. 2024) — standardized harmful-behavior eval.
  • AdvBench (Zou et al. 2023) — adversarial prompt set.
  • JailbreakBench (Chao et al. 2024) — jailbreak evaluation suite.
  • ToxiGen (Hartvigsen et al. 2022) — generated toxic statements eval.

Techniques.

  • Refusal SFT — explicit (harmful prompt, polite refusal with reasoning) pairs.
  • Constitutional / RLAIF on harmlessness principles.
  • Preference tuning on safety pairs — DPO or PPO with (harmful, refusal) pairs.
  • Circuit breakers / representation engineering (Zou et al. 2024) — train the model to produce nonsensical / refusal-shaped activations on harmful representations. More robust to jailbreaks than refusal SFT alone.

The refusal balance. Over-refusal (refusing benign requests like “how do I kill a python process”) is a real failure mode and is measured by benchmarks like XSTest (Röttger et al. 2024) and OR-Bench (Cui et al. 2024). The trade-off between under-refusal and over-refusal is fundamental and cannot be eliminated, only positioned.

Governance scaffolding.

  • Model cards (Mitchell et al. 2019) — published transparency artifacts describing intended use, evaluations, and limitations.
  • Anthropic Responsible Scaling Policy — capability thresholds (ASL levels) that trigger safety reviews.
  • OpenAI Preparedness Framework — analogous capability thresholds and mitigations.
  • DeepMind Frontier Safety Framework (FSF) — capability tiers + safety mitigations.
  • EU AI Act + NIST AI RMF + UK AISI evaluations — external regulatory layer.

16. Evaluation

Post-training is impossible to do well without measurement. The benchmark landscape is fragmented and rapidly contaminating; treat any single number skeptically.

16.1 Automated knowledge + reasoning

  • MMLU (Hendrycks et al. 2021) — 57-task multiple-choice exam. Saturated on frontier models (>90% accuracy by 2024); contamination is widespread.
  • MMLU-Pro (Wang et al. 2024) — harder MMLU with 10 options instead of 4 and harder questions. The 2024-25 default knowledge benchmark.
  • BIG-Bench / BIG-Bench Hard (BBH) (Srivastava et al. 2022) — 200+ diverse tasks; BBH is the 23-task hard subset.
  • GPQA-Diamond (Rein et al. 2023) — graduate-level physics/chem/bio multiple-choice, designed to resist contamination. The current default reasoning benchmark.
  • MATH (Hendrycks et al. 2021) — 12.5k competition math problems.
  • GSM8K (Cobbe et al. 2021) — 8.5k grade-school math word problems. Saturated.
  • HumanEval (Chen et al. 2021, OpenAI) — 164 Python coding problems. Saturated.
  • MBPP (Austin et al. 2021) — Python programming benchmark.
  • SWE-Bench / SWE-Bench Verified (Jimenez et al. 2024) — real GitHub issues; the leading agentic-coding benchmark.
  • BFCL — Berkeley Function-Calling Leaderboard (Yan et al. 2024) — tool use eval.
  • HELM (Liang et al. 2022, Stanford) — holistic eval suite.
  • Open LLM Leaderboard (HuggingFace) — community-run aggregate.

16.2 Chat quality

  • MT-Bench (Zheng et al. 2023, LMSYS) — 80 multi-turn prompts judged by GPT-4.
  • AlpacaEval / AlpacaEval 2 (Dubois et al. 2023, Stanford) — pairwise judged comparison with length-controlled winrates.
  • Chatbot Arena / LMSYS Arena → lmarena.ai — crowdsourced pairwise comparisons producing an Elo leaderboard. The single most influential chat benchmark; the LMSYS rebrand to lmarena.ai happened in 2024.
  • Arena Hard (Li et al. 2024) — hard subset of arena prompts with LLM- judge for cheap iteration.

16.3 Safety + truthfulness

  • HarmBench — refusal of harmful behaviors.
  • TruthfulQA (Lin et al. 2022) — adversarial truthfulness eval.
  • BBQ — Bias Benchmark for QA (Parrish et al. 2022) — social-bias evaluation.
  • ToxiGen — toxic-generation evaluation.
  • XSTest, OR-Bench — over-refusal evaluation.

16.4 Domain

  • MedQA, USMLE, MultiMedQA — medical.
  • LegalBench — legal reasoning.
  • FinanceBench — finance.

17. Synthetic data generation

Almost all modern post-training pipelines rely heavily on synthetic data — data generated by another model (or by the model itself) rather than written by humans. Major techniques:

  • Self-Instruct (Wang et al. 2022) — seed a strong LM with a few human- written instructions, ask it to generate more.
  • Alpaca / WizardLM / Orca — distillation from a stronger teacher.
  • Evol-Instruct (WizardLM) — recursively rewrite prompts to be harder.
  • Persona-driven generation — Anthropic and Microsoft have published on prompting a teacher to adopt many persona / style axes to produce diverse outputs.
  • Rejection sampling self-distillation — sample many completions from the model itself, score with an RM or verifier, keep the top-k, SFT on those. Used in Llama 2/3 and many frontier recipes.
  • Constitutional / critique-revise — see section 9.
  • Verifier-filtered code/math — sample, run the verifier, keep correct. Backbone of the reasoning RL data flywheel.
  • MammoTH, MetaMathQA, OpenMathInstruct — synthetic math corpora.
  • Magpie — self-synthesize instructions by prompting an aligned model with only the chat-template start tokens.

Synthetic data flywheels are now arguably the dominant input to frontier post-training; humans curate, design templates, and audit rather than write data line by line.


18. Frameworks and tools

The OSS post-training ecosystem in 2025-26 is dense. The most-used:

  • HuggingFace TRL — SFT, PPO, DPO, GRPO, KTO, ORPO, IPO. Integrated with transformers, peft, accelerate, bitsandbytes. The default choice for almost everything OSS.
  • Axolotl (OpenAccess-AI-Collective) — config-driven YAML wrapper around TRL + DeepSpeed; great for reproducible runs.
  • LLaMA-Factory (Zheng et al., Tsinghua) — popular Chinese-team toolkit, GUI-driven, supports nearly every method.
  • Unsloth — custom Triton kernels for LoRA / QLoRA giving 2-5x speedups and 60-80% memory reduction. The hobbyist favorite.
  • DeepSpeed-Chat (Microsoft) — full SFT + RM + PPO pipeline with ZeRO.
  • Megatron-LM + Megatron-Core (NVIDIA) — large-scale tensor + pipeline parallel; the backbone of most frontier-lab training stacks.
  • NVIDIA NeMo Aligner / NeMo Curator / NeMo Skills — RLHF + data curation on Megatron.
  • OpenRLHF — Ray-based scalable RLHF, used in DeepSeek’s recipe.
  • verl (ByteDance 2024) — modern PPO/GRPO framework with hybrid engine (vLLM for rollouts + Megatron/FSDP for training).
  • TorchTune (PyTorch official) — Meta’s reference fine-tuning stack.
  • PEFT (HuggingFace) — adapter library.
  • bitsandbytes — 8-bit and 4-bit quantization primitives.

Managed / API:

  • OpenAI fine-tuning API — managed SFT and DPO.
  • Anthropic fine-tuning — managed (in private preview / limited access for Claude variants).
  • Google Vertex fine-tuning — managed for Gemini.
  • Together AI, Fireworks, Modal, Anyscale, Lambda, RunPod, Replicate — managed GPU compute for OSS training.
  • Predibase, LoRAX — managed LoRA fine-tuning and multi-adapter serving.

19. Compute scale 2024-26

Rough order-of-magnitude rented-GPU costs. Numbers move with the spot market and with kernel-level efficiency improvements (Unsloth, FlashAttention-3, torch.compile), so treat as ballpark.

  • SFT a 7B model on 100k examples — full fine-tune at bf16, ~12-24 hours on a single A100 80GB or H100. ~$100-300.
  • LoRA on a 7B model — ~$20-60.
  • LoRA / QLoRA on a 70B model — single H100 or 2-4x A100 for 1-3 days. ~$200-1000.
  • Full fine-tune 70B model — 8-32 H100s for several days. ~$5k-20k.
  • Full fine-tune 405B model — large cluster needed. ~$100k-500k.
  • DPO is comparable to SFT, slightly more memory due to two model passes (policy + reference).
  • PPO RLHF on a 70B — 3-10x SFT cost due to rollouts + RM + reference model. ~$50k-500k+ for a frontier-style run.
  • Reasoning RL on a 70B — heavy rollout cost (long chains of thought, many samples per prompt). ~$200k-2M+.
  • Frontier-lab reasoning RL (o1, R1, Gemini Thinking) — tens of millions of dollars of compute, comparable to or exceeding the original pretrain.

Storage and labeling are smaller line items but not negligible: a high- quality preference dataset of 100k pairs from professional annotators is typically $50k-500k depending on domain.


20. Common pitfalls

A non-exhaustive list of footguns that show up in every team’s first post-training run.

  • Catastrophic forgetting. Heavy SFT on a narrow domain destroys general capability. Mitigation: mix 10-30% general data into domain SFT; use lower learning rates; use LoRA instead of full fine-tune; evaluate general benchmarks (MMLU, GPQA) alongside domain benchmarks.
  • Reward hacking in PPO. The policy finds patterns the RM rewards but humans don’t want. Mitigation: KL regularization; RM ensembling; periodic human spot-checks of policy outputs; cap reward at a percentile.
  • Mode collapse / short-response bias in DPO. Heavy DPO training can drive responses to a narrow style, often very short. Mitigation: length- normalized variants (SimPO), lower β, fewer epochs (DPO is usually only 1-2 epochs), or mix in an SFT term (ORPO, CPO).
  • Overfitting to a small SFT set. Loss collapsing below ~0.3 on a small corpus is memorization, not learning. Mitigation: hold out a validation split, watch held-out chat eval rather than train loss, stop early.
  • Eval contamination. Pre-train and SFT corpora often include test sets (MMLU, GSM8K, HumanEval prompts have all been found in web crawls). Mitigation: contamination scans (n-gram match against eval data), use resistant benchmarks (GPQA-Diamond, SWE-Bench Verified), report Arena Elo.
  • Tokenizer mismatch. LLaMA-2 and LLaMA-3 have different tokenizers; Qwen, Mistral, and Gemma all differ. SFT data tokenized with the wrong tokenizer silently produces garbage. Always re-tokenize for the target model.
  • Chat template drift. SFT data must use the exact chat template the model serves with (special tokens, role markers, BOS/EOS). A common bug is to fine-tune in one template and deploy with another, causing the model to emit role markers or refuse to stop generating.
  • Loss masking errors. Forgetting to mask the prompt out of the loss causes the model to learn to predict prompts as well as responses, degrading instruction-following.
  • β tuned in the wrong direction. Both PPO and DPO have a β / KL coefficient. Tune it as the first hyperparameter, not the last; it dominates everything else.
  • Truncated / unpacked sequences. Without sequence packing on right- skewed instruction data, you waste 2-4x compute on padding.
  • Reference model drift in DPO. π_ref must be the exact SFT checkpoint that produced the preference data, not a stale snapshot.
  • Mixing chat and base models. Fine-tuning a chat-tuned model further with SFT data formatted for a base model produces incoherent outputs.
  • Insufficient eval. Training without a held-out preference eval, a general-capability eval, and a safety eval is flying blind. Single-number benchmarks lie; trust deltas across many benchmarks.

21. Cross-references


22. Citations

  • Ouyang et al. 2022, NeurIPS — Training language models to follow instructions with human feedback (InstructGPT).
  • Stiennon et al. 2020, NeurIPS — Learning to summarize from human feedback.
  • Christiano et al. 2017, NeurIPS — Deep reinforcement learning from human preferences.
  • Schulman et al. 2017 — Proximal Policy Optimization Algorithms (PPO).
  • Schulman et al. 2016, ICLR — High-Dimensional Continuous Control Using Generalized Advantage Estimation (GAE).
  • Rafailov et al. 2023, NeurIPS — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO).
  • Azar et al. 2023, DeepMind — A General Theoretical Paradigm to Understand Learning from Human Preferences (IPO).
  • Ethayarajh et al. 2024 — KTO: Model Alignment as Prospect Theoretic Optimization.
  • Hong et al. 2024 — ORPO: Monolithic Preference Optimization without Reference Model.
  • Meng et al. 2024 — SimPO: Simple Preference Optimization with a Reference-Free Reward.
  • Shao et al. 2024, DeepSeek — DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO).
  • DeepSeek-AI 2025 — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
  • Bai et al. 2022, Anthropic — Constitutional AI: Harmlessness from AI Feedback.
  • Bai et al. 2022, Anthropic — Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (HH-RLHF).
  • Lee et al. 2023, Google — RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.
  • Hu et al. 2021, Microsoft, ICLR 2022 — LoRA: Low-Rank Adaptation of Large Language Models.
  • Dettmers et al. 2023, NeurIPS — QLoRA: Efficient Finetuning of Quantized LLMs.
  • Liu et al. 2024, NVIDIA — DoRA: Weight-Decomposed Low-Rank Adaptation.
  • Zhang et al. 2023 — AdaLoRA: Adaptive Budget Allocation for Parameter- Efficient Fine-Tuning.
  • Lester et al. 2021 — The Power of Scale for Parameter-Efficient Prompt Tuning.
  • Li & Liang 2021 — Prefix-Tuning: Optimizing Continuous Prompts for Generation.
  • Liu et al. 2022 — P-Tuning v2.
  • Liu et al. 2022 — Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning ((IA)^3).
  • Touvron et al. 2023, Meta — Llama 2: Open Foundation and Fine-Tuned Chat Models (RLHF details).
  • Wei et al. 2022 — Finetuned Language Models Are Zero-Shot Learners (FLAN).
  • Wang et al. 2022 — Self-Instruct: Aligning Language Models with Self- Generated Instructions.
  • Taori et al. 2023, Stanford — Alpaca.
  • Xu et al. 2023 — WizardLM: Empowering Large Language Models to Follow Complex Instructions.
  • Mukherjee et al. 2023, Microsoft — Orca: Progressive Learning from Complex Explanation Traces of GPT-4.
  • Cui et al. 2023, Tsinghua — UltraFeedback.
  • Zhou et al. 2023 — LIMA: Less Is More for Alignment.
  • Lightman et al. 2023, OpenAI — Let’s Verify Step by Step (PRM).
  • Bradley & Terry 1952 — Rank Analysis of Incomplete Block Designs.
  • Hinton et al. 2015 — Distilling the Knowledge in a Neural Network.
  • Sanh et al. 2019 — DistilBERT.
  • Hendrycks et al. 2021 — MMLU and MATH.
  • Cobbe et al. 2021, OpenAI — GSM8K.
  • Chen et al. 2021, OpenAI — HumanEval.
  • Rein et al. 2023 — GPQA.
  • Wang et al. 2024 — MMLU-Pro.
  • Jimenez et al. 2024 — SWE-Bench.
  • Liang et al. 2022, Stanford — HELM.
  • Zheng et al. 2023, LMSYS — MT-Bench, Chatbot Arena, LLM-as-judge. Arena rebranded to lmarena.ai in 2024.
  • Dubois et al. 2023, Stanford — AlpacaEval.
  • Mazeika et al. 2024 — HarmBench.
  • Zou et al. 2023 — AdvBench.
  • Chao et al. 2024 — JailbreakBench.
  • Lin et al. 2022 — TruthfulQA.
  • Mitchell et al. 2019 — Model Cards.
  • von Werra et al. 2020+, HuggingFace — TRL library.