Reinforcement Learning Theory — Math Reference
1. At a glance
Reinforcement learning (RL) is the formal study of agents that learn to maximize a cumulative scalar reward signal through repeated interaction with an environment. Unlike supervised learning (which fits a function to labeled examples) or unsupervised learning (which discovers structure in unlabeled data), RL is defined by a closed loop: the agent’s actions shape the distribution of future states it sees, and the only ground truth is a scalar reward — usually sparse, often delayed, and rarely fully informative.
This note covers the mathematical theory of RL: Markov decision processes, Bellman equations, dynamic programming, temporal-difference learning, Q-learning, policy gradients, actor-critic methods, model-based RL, exploration, offline RL, imitation learning, and the connection to RLHF. Companion notes:
[[Robotics/rl-for-control]]— applications-focused (continuous control, sim-to-real, robot learning).[[Compute/fine-tuning-rlhf]]— the LLM application (reward modeling, PPO/DPO/GRPO on language models).[[Math/probability-fundamentals]]— the probabilistic substrate (expectations, conditional probability, Markov chains).[[Math/convex-optimization]]— when policies become parametric the loss landscape matters.
The single best entry point to the field remains Sutton + Barto’s Reinforcement Learning: An Introduction (2nd ed., 2018, free PDF online); the more measure-theoretic / optimization-flavored reference is Bertsekas Reinforcement Learning and Optimal Control (2019, 4 vols).
2. MDP (Markov Decision Process)
The standard mathematical object is the Markov decision process, a tuple (S, A, P, R, γ):
- S — state space. Discrete (e.g., grid cell), continuous (e.g., joint angles), or structured (e.g., images, graphs). Tabular RL assumes |S| finite; deep RL approximates over arbitrary S.
- A — action space. Discrete (Atari joystick), continuous (motor torques), or hybrid (e.g., select tool + continuous parameters).
- P(s’ | s, a) — transition kernel. A probability distribution over next states given current state-action. Encodes environment dynamics.
- R(s, a) (or R(s, a, s’)) — reward function. Scalar feedback received after the transition. Usually bounded.
- γ ∈ [0, 1) — discount factor. Trades immediate vs. future reward; γ = 0 is myopic, γ → 1 is far-sighted. Mathematically, γ < 1 guarantees that the infinite-horizon return is finite when R is bounded.
The defining structural assumption is the Markov property: the next state distribution depends only on the current state and action, not on history:
P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, …) = P(S_{t+1} | S_t, A_t).
When this fails (e.g., the agent only observes a noisy projection of true state), the problem becomes a partially observable MDP (POMDP) — adds an observation kernel Ω(o | s) and forces the agent to maintain a belief state (posterior over s). POMDPs are vastly harder; most practical “POMDP” methods either approximate the belief with a recurrent neural network or assume the observation is a Markovian sufficient statistic of the past few frames (Atari frame-stacking, recurrent DQN).
A policy π is a (possibly stochastic) mapping from states to action distributions: π(a | s). The agent’s job is to find a policy that maximizes expected discounted return.
3. Returns and value functions
The return at time t is the discounted sum of future rewards:
G_t = R_{t+1} + γ · R_{t+2} + γ² · R_{t+3} + … = Σ_{k=0}^∞ γ^k · R_{t+k+1}.
Two value functions encode “how good is it to be here / to do this”:
- State-value function V^π(s) = E^π [G_t | S_t = s] — expected return starting from state s and following policy π.
- Action-value function Q^π(s, a) = E^π [G_t | S_t = s, A_t = a] — expected return after taking action a in state s, then following π.
Their relationship: V^π(s) = Σ_a π(a | s) · Q^π(s, a).
The advantage function A^π(s, a) = Q^π(s, a) − V^π(s) measures how much better action a is than the average action under π. Advantage estimation is the central trick in modern policy gradients (it reduces variance without introducing bias — the baseline subtracts a state-dependent constant that has zero expected gradient).
An optimal policy πmaximizes V^π(s) at every s simultaneously (it exists for every finite MDP). Its value functions V = V^{π*} and Q* = Q^{π*} are unique even though π* itself need not be unique.
4. Bellman equations
The defining recursion of RL. For any policy π:
Bellman expectation equation for V: V^π(s) = Σ_a π(a | s) · [R(s, a) + γ · Σ_{s’} P(s’ | s, a) · V^π(s’)].
Bellman expectation for Q: Q^π(s, a) = R(s, a) + γ · Σ_{s’} P(s’ | s, a) · Σ_{a’} π(a’ | s’) · Q^π(s’, a’).
For the optimal policy (Bellman optimality equations):
V*(s) = max_a [R(s, a) + γ · Σ_{s’} P(s’ | s, a) · V*(s’)] = max_a Q*(s, a). Q*(s, a) = R(s, a) + γ · Σ_{s’} P(s’ | s, a) · max_{a’} Q*(s’, a’).
These are self-consistency conditions, not algorithms. The optimal-V equation is a fixed-point equation for the Bellman optimality operator T*:
(T* V)(s) = max_a [R(s, a) + γ · Σ_{s’} P(s’ | s, a) · V(s’)].
Tis a γ-contraction in the sup norm: ||T V − T*W||∞ ≤ γ · ||V − W||∞. By the Banach fixed-point theorem, V is its unique fixed point, and iterating T converges geometrically. This single fact — contraction — is the engine behind both value iteration and the convergence proofs of Q-learning.
5. Dynamic programming
When P and R are known and S is small enough to tabulate, DP solves the MDP exactly. The two classical algorithms (Bellman 1957 Dynamic Programming; Howard 1960 Dynamic Programming and Markov Processes):
Policy iteration
Alternates two steps:
- Policy evaluation — given π, compute V^π by solving the linear system V^π = R_π + γ · P_π · V^π (here R_π and P_π are policy-averaged). Either by direct inversion (O(|S|³)) or by iterating the Bellman expectation operator to convergence.
- Policy improvement — set π’(s) = argmax_a Q^π(s, a). The policy improvement theorem says π’ is at least as good as π everywhere, strictly better somewhere unless π was already optimal.
Iterate until π stabilizes. Converges in a finite number of iterations for finite MDPs (because there are only finitely many deterministic policies).
Value iteration
Skip the full evaluation step and just iterate the Bellman optimality operator:
V_{k+1}(s) = max_a [R(s, a) + γ · Σ_{s’} P(s’ | s, a) · V_k(s’)].
Geometric convergence: ||V_k − V*||∞ ≤ γ^k · ||V_0 − V||*∞. Stop when ||V_{k+1} − V_k||∞ < ε (1 − γ) / γ, which bounds ||V_k − V||*∞ < ε.
Modified / generalized policy iteration
Most practical algorithms (including all of deep RL) live on a spectrum between PI and VI: a few sweeps of evaluation, a step of improvement, repeat. Sutton + Barto call this generalized policy iteration and argue it’s the unifying abstraction.
6. Model-based vs model-free
Two philosophies dominate:
- Model-based RL — learn an approximation of P and R from data, then plan (DP, model-predictive control, Monte Carlo tree search) inside the learned model. Sample-efficient; vulnerable to model bias. AlphaZero, MuZero, Dreamer.
- Model-free RL — never explicitly learn dynamics; directly learn V, Q, or π from interaction. Robust to model errors; sample-hungry. DQN, PPO, SAC.
A growing middle ground uses learned models for short-horizon imagination while still learning a model-free policy: Dyna (Sutton 1991), Dreamer family.
7. Monte Carlo methods
Estimate V^π or Q^π by sampling complete episodes and averaging the observed returns:
V(s) ← (1/N(s)) · Σ_{episodes containing s} G_t.
Pros: unbiased, no bootstrapping bias. Cons: high variance, requires episodic tasks, only updates at episode end. Mostly historical interest now, but reappears in REINFORCE-style policy gradients and in offline batch evaluation.
8. TD learning (Temporal Difference)
The other foundational idea — Sutton’s 1988 paper “Learning to Predict by the Methods of Temporal Differences” introduced bootstrapping: update an estimate using another estimate rather than waiting for the actual return.
TD(0) update for V: V(S_t) ← V(S_t) + α · [R_{t+1} + γ · V(S_{t+1}) − V(S_t)].
The bracketed quantity δ_t = R_{t+1} + γ · V(S_{t+1}) − V(S_t) is the TD error. TD(0) trades MC’s high variance for bootstrapping bias (you update V toward a target that is itself a noisy estimate).
TD(λ) interpolates between TD(0) (λ = 0) and MC (λ = 1) using eligibility traces — a decaying memory of recently visited states that all get updated by the current TD error. Provides a unified view of online and offline returns.
Convergence: TD(0) with linear function approximation under on-policy data converges to a unique fixed point (Tsitsiklis + Van Roy 1997). Off-policy + function approximation + bootstrapping is the deadly triad (Sutton + Barto §11) and can diverge.
9. SARSA
The on-policy TD control algorithm. Named for the quintuple (S, A, R, S’, A’) in its update:
Q(S_t, A_t) ← Q(S_t, A_t) + α · [R_{t+1} + γ · Q(S_{t+1}, A_{t+1}) − Q(S_t, A_t)],
where A_{t+1} is drawn from the current behavior policy (usually ε-greedy w.r.t. Q). SARSA converges to the optimal policy under the same conditions as Q-learning if the policy is GLIE (greedy in the limit with infinite exploration). Being on-policy makes it more conservative around dangerous states — it accounts for the exploration noise it will actually use.
10. Q-learning
Watkins’s 1989 Cambridge PhD thesis introduced off-policy TD control:
Q(S_t, A_t) ← Q(S_t, A_t) + α · [R_{t+1} + γ · max_{a’} Q(S_{t+1}, a’) − Q(S_t, A_t)].
The crucial difference from SARSA is max_{a'} instead of Q(S_{t+1}, A_{t+1}) — the target uses the greedy policy, but actions are still chosen by the behavior policy (e.g., ε-greedy). This decouples learning from behavior: you can learn the optimal Q function while exploring with any sufficiently rich behavior policy.
Convergence (Watkins + Dayan 1992, Tsitsiklis 1994): tabular Q-learning converges w.p. 1 to Q* if every (s, a) is visited infinitely often and step sizes α_t satisfy the Robbins–Monro conditions (Σ α_t = ∞, Σ α_t² < ∞).
Q-learning over-estimates action values because of the max operator (Jensen’s inequality on a noisy estimator). This motivates Double Q-learning, Double DQN, and clipped-double critics in TD3/SAC.
11. Function approximation
For anything beyond toy problems, Q (or V or π) must be approximated by a parametric family — historically linear (tile coding, Fourier basis, RBFs), now almost always a deep neural network. The “deep RL” era.
Deep Q-Network (DQN)
Mnih et al. 2013 NIPS Deep Learning Workshop / 2015 Nature — Atari from raw pixels. Two key tricks:
- Experience replay — store transitions (s, a, r, s’) in a buffer, sample minibatches uniformly. Decorrelates updates and reuses data.
- Target network — a second copy Q_θ⁻ of the network, updated every C steps, used to compute the TD target. Stabilizes training by preventing the moving target.
Loss: L(θ) = E[(r + γ · max_{a’} Q_{θ⁻}(s’, a’) − Q_θ(s, a))²].
Double DQN
Van Hasselt, Guez, Silver 2016 — addresses the over-estimation bias by decoupling action selection from evaluation:
target = r + γ · Q_{θ⁻}(s’, argmax_{a’} Q_θ(s’, a’)).
The online network picks the action; the target network evaluates it.
Dueling DQN
Wang et al. 2016 — split the network into a state-value stream V(s) and an advantage stream A(s, a), recombined as Q(s, a) = V(s) + A(s, a) − mean_a A(s, a). Helps when the action choice doesn’t matter much in many states.
Prioritized experience replay
Schaul et al. 2016 — sample transitions with probability proportional to |TD error|^α, with importance-sampling correction. Focuses learning on surprising transitions.
Rainbow
Hessel et al. 2018 — combines Double, Dueling, Prioritized Replay, Multi-step targets, Distributional Q (C51), and NoisyNets. The reference Q-learning baseline.
Distributional RL (C51, QR-DQN, IQN)
Bellemare, Dabney, Munos 2017 A Distributional Perspective on Reinforcement Learning — learn the distribution of returns Z^π(s, a), not just its mean Q^π(s, a). The distributional Bellman operator is also a contraction (in Wasserstein metric). C51 parameterizes Z as a categorical distribution over 51 atoms; QR-DQN uses quantile regression; IQN samples quantiles.
Conservative + Implicit Q-Learning (offline)
For offline / batch RL where exploration isn’t possible:
- CQL (Kumar et al. 2020) — adds a regularizer that pushes Q-values down on out-of-distribution actions, preventing the policy from exploiting Q-function errors.
- IQL (Kostrikov et al. 2021) — never queries Q on out-of-distribution actions at all; uses expectile regression on V and a separate weighted-BC step for policy extraction.
12. Policy gradient methods
Instead of learning Q and acting greedily, directly parameterize π_θ(a | s) and ascend the gradient of the expected return J(θ) = E^{π_θ}[G_0].
Policy gradient theorem
Sutton, McAllester, Singh, Mansour 2000 — despite the dependence of the state distribution on θ, the gradient has a remarkably clean form:
∇_θ J(θ) = E^{π_θ} [Σ_t ∇_θ log π_θ(A_t | S_t) · Q^{π_θ}(S_t, A_t)].
Or equivalently with any baseline b(s) that doesn’t depend on a:
∇_θ J(θ) = E [Σ_t ∇_θ log π_θ(A_t | S_t) · (Q^{π_θ}(S_t, A_t) − b(S_t))].
REINFORCE
Williams 1992 — the canonical Monte Carlo policy gradient:
∇J ≈ Σ_t ∇_θ log π_θ(A_t | S_t) · G_t.
Unbiased but extreme variance — single trajectory returns can be wildly noisy. Subtracting a baseline (often the running mean of G_t, or V̂(S_t)) helps a lot.
Actor-Critic
Use a learned value function (the critic) as the baseline and/or as a substitute for the unbiased G_t. Trades MC’s variance for TD’s bias. The actor updates π_θ; the critic updates V_φ or Q_φ.
A2C / A3C
Mnih et al. 2016 — Asynchronous Methods for Deep Reinforcement Learning. Uses the advantage A(s, a) ≈ r + γ · V(s’) − V(s) as the policy gradient weight. A3C ran many actors asynchronously; A2C is the synchronous variant that empirically performs as well on GPUs.
TRPO (Trust Region Policy Optimization)
Schulman et al. 2015 — large policy updates can collapse performance irreversibly. TRPO constrains the KL between old and new policies:
maximize_θ E [(π_θ(a | s) / π_{θ_old}(a | s)) · A^{π_{θ_old}}(s, a)] subject to E [KL(π_{θ_old}(· | s) || π_θ(· | s))] ≤ δ.
Solved with conjugate gradient + line search. Sound but expensive.
PPO (Proximal Policy Optimization)
Schulman et al. 2017 — the workhorse of modern RL. Replaces the hard KL constraint with a clipped surrogate:
L^CLIP(θ) = E [min(r_t(θ) · Â_t, clip(r_t(θ), 1 − ε, 1 + ε) · Â_t)],
where r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t). Simple SGD, no second-order machinery, multiple epochs of minibatch updates per rollout. PPO with generalized advantage estimation (GAE, Schulman et al. 2016) became the default for both continuous control and RLHF. See [[Compute/fine-tuning-rlhf]] for the LM-fine-tuning use.
DDPG (Deep Deterministic Policy Gradient)
Lillicrap et al. 2015 — for continuous action spaces. Deterministic policy μ_θ(s); critic Q_φ(s, a); off-policy with replay buffer; uses the deterministic policy gradient theorem (Silver et al. 2014):
∇_θ J = E [∇a Q_φ(s, a)|{a = μ_θ(s)} · ∇_θ μ_θ(s)].
Notoriously brittle; sensitive to hyperparameters and exploration noise.
TD3 (Twin Delayed DDPG)
Fujimoto, van Hoof, Meger 2018 — fixes DDPG with three tricks:
- Clipped double critics: two Q-networks, take the min in the target.
- Delayed policy updates: update π less often than Q.
- Target policy smoothing: add noise to the target action.
SAC (Soft Actor-Critic)
Haarnoja et al. 2018 — maximum-entropy actor-critic. Augments the reward with the policy entropy:
J(π) = E [Σ_t (r(s_t, a_t) + α · H(π(· | s_t)))].
The optimal policy under this objective is π*(a | s) ∝ exp(Q*(s, a) / α). SAC alternates soft policy evaluation, soft policy improvement, and (in SAC v2) automatic tuning of α to hit a target entropy. Empirically dominant for continuous control.
13. Maximum entropy RL
Generalizes the reward objective to include policy entropy. Theoretical foundation (Ziebart 2010, Levine 2018 Reinforcement Learning and Control as Probabilistic Inference):
- Encourages exploration without needing ad-hoc bonuses.
- Recovers the soft Bellman equation Q_soft(s, a) = r + γ · E[α · log Σ_{a’} exp(Q_soft(s’, a’) / α)].
- Connects RL to probabilistic inference (control as inference): the optimal policy is the posterior over actions given optimality observations.
- Yields robustness to reward perturbations (the policy doesn’t commit fully to a single mode).
14. Model-based RL
Learn a model of dynamics and use it for planning, imagination, or both.
MuZero
Schrittwieser et al. 2019 — Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Successor to AlphaZero. Learns a latent dynamics model (no need to reconstruct observations) jointly with value, policy, and reward. Plans with MCTS in latent space. Achieves AlphaZero-level performance in Go/chess/shogi (which AlphaZero needed perfect rules for) and SOTA on Atari.
Dreamer / DreamerV3
Hafner et al. 2020 Dream to Control; 2023 DreamerV3 — recurrent world model + actor-critic trained on imagined rollouts in latent space. DreamerV3 (2023) solves a remarkable diversity of tasks (Atari, ProcGen, DMC, Minecraft diamond collection) with a single set of hyperparameters.
PlaNet
Hafner et al. 2019 — recurrent state-space model (RSSM) combining stochastic + deterministic latent dynamics. Foundation for the Dreamer family.
PETS (Probabilistic Ensembles with Trajectory Sampling)
Chua et al. 2018 — ensemble of probabilistic dynamics models; plans with the cross-entropy method (CEM). Strong sample efficiency on MuJoCo.
TD-MPC2
Hansen et al. 2024 — scaling model-based RL with implicit world models + MPC; competitive across hundreds of tasks with a single architecture.
15. Exploration
The fundamental problem: an agent that only exploits what it knows will never discover better strategies; an agent that explores too much never reaps the reward.
- ε-greedy — with probability ε act randomly, else greedy. Simplest possible; surprisingly hard to beat in practice for stochastic environments.
- Boltzmann / softmax — π(a | s) ∝ exp(Q(s, a) / τ). Temperature τ controls exploration; higher Q values are favored more smoothly than under ε-greedy.
- Upper Confidence Bound (UCB) — Auer, Cesa-Bianchi, Fischer 2002 — pick argmax_a [Q̂(s, a) + c · √(log N / N(s, a))]. Optimism in the face of uncertainty.
- Thompson Sampling — Thompson 1933 (yes, that early), revived by Russo, Van Roy, Osband. Maintain a posterior over Q (or model parameters); each step sample from the posterior and act greedily w.r.t. the sample. Bayesian, often empirically best in bandits.
- Intrinsic motivation — augment the extrinsic reward with a curiosity bonus. ICM (Pathak et al. 2017) rewards prediction error of a forward model in a self-supervised feature space; RND (Burda et al. 2018, Random Network Distillation) rewards distance from a frozen random network’s output — both essentially novelty signals.
- Count-based bonuses — Bellemare et al. 2016 Unifying Count-Based Exploration; use a density model to define “pseudo-counts” N̂(s) and bonus 1/√N̂(s).
- Information-gain bonuses — VIME (Houthooft et al. 2016) — Bayesian variational information gain on a dynamics model.
- Go-Explore (Ecoffet et al. 2019) — explicit archive of visited states + return-to-explore. SOTA on Montezuma’s Revenge.
16. Offline RL (batch RL)
Learn purely from a fixed dataset of trajectories — no environment interaction. The dominant practical setting for safety-critical applications (medicine, autonomous driving). Took off after 2019.
Core difficulty: distributional shift. The dataset contains transitions from a behavior policy π_β; the learned policy π may produce actions outside the support of π_β where Q estimates are unreliable. Naive Q-learning extrapolates wildly and the policy exploits the over-estimates.
Algorithms:
- CQL (Kumar et al. 2020) — adds Σ_s log Σ_a exp(Q(s, a)) − E_{a ~ π_β} [Q(s, a)] regularizer; pushes Q down on out-of-dataset actions.
- IQL (Kostrikov et al. 2021) — expectile regression on V; never queries Q outside dataset support; policy extracted by advantage-weighted BC.
- BCQ (Fujimoto et al. 2019) — Batch-Constrained Q-learning; restricts actions to a learned generative model of the behavior policy.
- BRAC (Wu et al. 2019) — explicit policy regularization toward π_β (KL divergence).
- AWR / AWAC (Peng et al. 2019 / Nair et al. 2020) — advantage-weighted regression; weighted behavior cloning.
- Decision Transformer (Chen et al. 2021) — reframes RL as sequence modeling: a transformer is trained to predict actions conditioned on (return-to-go, state) tokens. Simple supervised learning, no Bellman backups, surprisingly competitive.
- Trajectory Transformer (Janner et al. 2021) — autoregressive over full trajectories, decodes via beam search.
17. Imitation learning
When demonstrations exist, learn from them directly.
- Behavioral cloning (BC) — supervised learning on (state, action) pairs. Simple; suffers from covariate shift — small errors compound as the policy visits states unlike any in the demonstration set.
- DAgger (Dataset Aggregation) — Ross, Gordon, Bagnell 2011. Iteratively query the expert on states the current policy visits; retrain. Provably reduces compounding error to O(T) instead of O(T²). Requires an interactive expert.
- GAIL (Generative Adversarial Imitation Learning) — Ho + Ermon 2016. A discriminator tries to distinguish agent trajectories from expert trajectories; the agent is rewarded for fooling the discriminator. Implicitly recovers a reward function.
- AIRL (Fu, Luo, Levine 2018) — adversarial IRL with a more interpretable reward decomposition.
- Inverse RL (IRL) — recover a reward function from demonstrations under which the demonstrator is (approximately) optimal. Ng + Russell 2000, Ziebart’s maximum-entropy IRL 2008. Ill-posed (many rewards explain any policy); MaxEnt resolves the ambiguity.
18. Multi-agent RL (MARL)
When multiple agents share an environment, each agent’s environment is non-stationary from its own perspective (other agents’ policies are changing).
- Cooperative — common reward; e.g., StarCraft micromanagement.
- Competitive — zero-sum; e.g., chess, Go.
- Mixed motives — general-sum; e.g., social dilemmas (Leibo et al. 2017).
Algorithmic frameworks:
- Independent Q-learning — each agent runs Q-learning treating others as part of the environment. Simple, often surprisingly effective, but the deadly triad lurks.
- Centralized training, decentralized execution — a central critic sees all agents’ info during training; agents act independently at test time. MADDPG (Lowe et al. 2017), COMA (Foerster et al. 2018), QMIX (Rashid et al. 2018), MAPPO (Yu et al. 2022).
- Self-play — agent plays against past versions of itself. Foundation of AlphaZero, OpenAI Five (Dota 2), AlphaStar (StarCraft II).
- Game-theoretic equilibria — Nash, correlated, coarse-correlated. Minimax-Q (Littman 1994), Nash-Q (Hu + Wellman 2003), policy-space response oracles (PSRO, Lanctot et al. 2017).
19. Hierarchical RL (HRL)
Decompose long-horizon problems into nested levels of temporal abstraction.
- Options framework — Sutton, Precup, Singh 1999. An option is a triple (initiation set, internal policy, termination condition) — a temporally extended action. SMDP theory generalizes Bellman equations to options.
- Option-Critic — Bacon, Harb, Precup 2017 — learn options end-to-end with policy gradients.
- Feudal RL / FuN — Vezhnevets et al. 2017. A manager sets goal vectors in latent space; a worker learns to achieve them.
- HIRO (Nachum et al. 2018) — off-policy hierarchical RL with goal relabeling.
- HAC (Hierarchical Actor-Critic, Levy et al. 2019) — universal value functions at multiple levels of hierarchy.
20. RLHF (RL from Human Feedback)
The bridge from classical RL to modern LLM alignment. Covered in depth in [[Compute/fine-tuning-rlhf]]. Compressed view:
- Supervised fine-tuning (SFT) on demonstration data.
- Reward model (RM) training on human pairwise preferences using the Bradley–Terry model: P(y_w ≻ y_l | x) = σ(r_θ(x, y_w) − r_θ(x, y_l)).
- PPO (or DPO, GRPO, RLOO, REMAX) on the LM against the learned reward, with a KL penalty to the SFT policy to prevent reward hacking and mode collapse.
Notable papers: Christiano et al. 2017 (deep RL from preferences, originally Atari + MuJoCo); Stiennon et al. 2020 (summarization); Ouyang et al. 2022 InstructGPT; Bai et al. 2022 Constitutional AI. Direct Preference Optimization (Rafailov et al. 2023) collapses RM training and PPO into a single supervised objective — a major simplification. Group Relative Policy Optimization (GRPO, DeepSeekMath 2024) replaces PPO’s critic with a Monte-Carlo group baseline; widely used in reasoning-model RL.
21. Theoretical results
A non-exhaustive tour of the formal guarantees:
- Bellman optimality + contraction — already covered. The single most important theorem.
- Convergence of tabular Q-learning — Watkins + Dayan 1992; Tsitsiklis 1994 (more general). Requires Robbins–Monro step sizes and infinite visitation.
- Convergence of TD with linear FA on-policy — Tsitsiklis + Van Roy 1997. Off-policy can diverge (Baird’s counterexample 1995).
- Policy gradient theorem — Sutton et al. 2000.
- Performance difference lemma — Kakade + Langford 2002. Bounds the value gap between two policies in terms of advantage; foundation of TRPO/PPO theory.
- PAC-MDP analysis — Strehl et al. 2009. Number of suboptimal steps before an algorithm is “probably approximately correct” on the MDP.
- Regret bounds for tabular RL — Jaksch, Ortner, Auer 2010 UCRL2; near-optimal regret O(D · S · √(A · T)) where D is MDP diameter.
- Sample complexity lower bounds — Azar, Munos, Kappen 2013; Jin et al. 2018 — minimax lower bounds for tabular MDPs and linear MDPs.
- Convergence of natural policy gradient + NPG/TRPO — Agarwal, Kakade, Lee, Mahajan 2021 On the Theory of Policy Gradient Methods.
22. Benchmarks and environments
A non-exhaustive list of standard testbeds:
- Gymnasium (Farama Foundation, successor to OpenAI Gym) — Atari (via ALE), classic control (CartPole, Pendulum, Acrobot), MuJoCo (HalfCheetah, Hopper, Walker2d, Ant, Humanoid), Box2D (LunarLander, BipedalWalker).
- DeepMind Control Suite (DMC) — continuous control on MuJoCo with consistent reward structure; widely used in image-based RL benchmarks.
- DeepMind Lab — 3D maze / navigation tasks.
- bsuite (Behaviour Suite) — Osband et al. 2019; diagnostic environments for credit assignment, exploration, scale, memory.
- MuJoCo — Todorov, Erez, Tassa 2012; now open-source under DeepMind. The standard continuous-control physics engine.
- Brax — Freeman et al. 2021. JAX-native, GPU/TPU-accelerated MuJoCo-style physics; massive throughput.
- Isaac Gym / Isaac Lab — NVIDIA; GPU-parallelized robotics simulation with thousands of environments per GPU. Foundation of large-scale sim-to-real legged locomotion work.
- MetaWorld — Yu et al. 2020. 50 robotic manipulation tasks for multi-task and meta-learning.
- Procgen — Cobbe et al. 2020. Procedurally generated arcade environments to test generalization.
- RLBench — James et al. 2020. Robot manipulation benchmark on CoppeliaSim.
- ALE (Arcade Learning Environment) — Bellemare et al. 2013. The original Atari benchmark, 57 games.
- Crafter — Hafner 2021. Simplified Minecraft-like 2D environment; explicit achievement hierarchy.
- Habitat — Savva et al. 2019; Habitat 2.0 (2021), Habitat 3.0 (2023). Photorealistic indoor scenes for embodied AI.
- iGibson — Stanford; large-scale interactive indoor scenes.
- Minecraft / MineRL / Minedojo — open-ended; MineRL competition 2019-2022; Minedojo (Fan et al. 2022) provides a foundation-model-friendly API.
- PettingZoo — Terry et al. 2021. Multi-agent analog of Gym; classic + Atari + MPE + Butterfly + SISL.
- NetHack Learning Environment — Küttler et al. 2020. Sparse-reward, procedurally generated dungeon crawler.
23. Libraries
Modern open-source RL libraries:
- Stable-Baselines3 (SB3) — Raffin et al. 2021. PyTorch reference implementations of PPO, SAC, TD3, DDPG, A2C, DQN. Tested, documented, the go-to for “I want a working baseline”.
- RLlib — part of Ray (Anyscale). Distributed; production-ready; many algorithms; somewhat heavyweight.
- CleanRL — Huang et al. 2022. Single-file, pedagogically clear implementations. The “read the source to understand the algorithm” library.
- TorchRL — PyTorch core. Modular, GPU-native, includes vectorized environments and replay buffers.
- Acme — Hoffman et al. 2020 (DeepMind). Modular agent framework with Reverb (distributed replay).
- Tianshou — Weng et al. 2022. PyTorch, modular, fast.
- JAX-RL — multiple ecosystems: RLax (DeepMind utilities), PureJaxRL (Lu et al.), Brax-based pipelines. End-to-end JAX is currently the fastest path for large-scale RL.
- Dopamine — Google 2018; reference for value-based RL research (DQN, C51, Rainbow, IQN).
24. Successes and AI milestones
A non-exhaustive timeline of high-water marks:
- TD-Gammon — Tesauro 1992-1995. TD(λ) + MLP on backgammon. Reached world-class play with self-play; the first major demonstration of RL at the championship level. Quietly anticipated the deep-RL era by two decades.
- Atari DQN — Mnih et al. 2013/2015. Single architecture playing 49 Atari games from pixels, superhuman on many. The catalyst for the modern deep-RL boom.
- AlphaGo — Silver et al. 2016 Nature. Defeated Lee Sedol 4-1. Hybrid SL (from human games) + RL (self-play) + MCTS.
- AlphaGo Zero — Silver et al. 2017. Pure self-play; no human data. Surpassed AlphaGo in 3 days.
- AlphaZero — Silver et al. 2018 Science. Generalization of AlphaGo Zero to chess and shogi; defeated Stockfish and Elmo.
- MuZero — Schrittwieser et al. 2019. AlphaZero without knowing the rules; planning with a learned dynamics model.
- OpenAI Five — Berner et al. 2019. 5v5 Dota 2; defeated the world champions OG in 2019.
- AlphaStar — Vinyals et al. 2019 Nature. Grandmaster-level StarCraft II via league self-play + LSTM + pointer networks.
- GT Sophy — Wurman et al. 2022 Nature. Superhuman Gran Turismo racing via SAC + curriculum + multi-agent self-play.
- Tesla FSD, Waymo, Cruise — production self-driving stacks use IL + RL components (especially for behavioral planning and simulation training).
- ChatGPT / InstructGPT — Ouyang et al. 2022. RLHF on GPT-3 — the breakthrough that made LLMs broadly useful.
- AlphaProof / AlphaGeometry 2 — DeepMind 2024. RL agents for IMO-level math; achieved silver-medal level on the 2024 IMO.
- DeepSeek-R1, OpenAI o1 / o3 — 2024-2025. RL on chain-of-thought for math and code; large-scale GRPO / outcome reward training.
25. Pitfalls
A field-tested list of failure modes:
- Reward hacking — agent finds a degenerate strategy that maximizes the proxy reward but not the intended behavior. CoastRunners famously circled in a lagoon collecting bonuses (Amodei + Clark 2016, Faulty Reward Functions in the Wild). Pervasive in RLHF (sycophancy, length bias, refusal hacking).
- Sparse reward — when rewards are 0 almost everywhere, naïve algorithms get no learning signal. Mitigations: reward shaping (Ng, Harada, Russell 1999 — potential-based shaping preserves optimal policy), curriculum, intrinsic motivation, hindsight experience replay (Andrychowicz et al. 2017).
- Non-stationarity — in multi-agent or continual settings, the “environment” includes other learners. Convergence guarantees evaporate.
- Sample inefficiency — DQN on Atari needs 200M frames; PPO on humanoid locomotion needs tens of millions of steps; even Dreamer needs millions. Compared to humans this is grotesque.
- Sim-to-real gap — policies trained in simulation often fail on real hardware due to mis-modeled dynamics, latency, sensor noise. Domain randomization (Tobin et al. 2017), system identification, adversarial perturbations.
- Catastrophic forgetting — in continual RL the agent forgets old skills when learning new ones; replay, EWC (Kirkpatrick 2017), progressive networks (Rusu 2016).
- Deadly triad — function approximation + bootstrapping + off-policy data can diverge (Baird 1995). Target networks and conservative updates mitigate but don’t eliminate.
- Reward design difficulty — specifying a reward that incentivizes the truly desired behavior is harder than it looks; cf. specification gaming, Goodhart’s law.
- Brittleness and high variance across seeds — Henderson et al. 2018 Deep RL That Matters. Many “improvements” are within seed variance. Always report multiple seeds with confidence intervals.
- Distribution shift in offline RL — the policy must not query Q on out-of-distribution actions. Driving force behind CQL, IQL, BCQ.
26. Cross-references
[[Math/_index]]— math reference family index.[[Math/probability-fundamentals]]— expectations, conditional probability, Markov chains.[[Math/bayesian-inference]]— Thompson sampling, control as inference, posterior sampling for RL.[[Math/convex-optimization]]— natural gradient, trust regions, KL constraints.[[Math/information-theory]]— entropy bonuses, KL penalties, mutual-information exploration.[[Math/mcmc-sampling]]— model-based planning sometimes uses MCMC; preference data uses Bradley–Terry.[[Robotics/rl-for-control]]— applications to robot control (sim-to-real, legged locomotion, manipulation).[[Robotics/Tier3/control-algorithms]]— see the RL section.[[Compute/fine-tuning-rlhf]]— RLHF for language models (RM training, PPO, DPO, GRPO).[[Compute/transformer-architecture]]— the policy class for modern RLHF and Decision Transformer.
27. Citations
Foundational textbooks:
- Sutton, R. S., + Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Free PDF at http://incompleteideas.net/book/the-book-2nd.html. The canonical introduction.
- Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific. More optimization-flavored.
- Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Morgan + Claypool. Compact theoretical treatment.
Classical theory:
- Bellman, R. (1957). Dynamic Programming. Princeton University Press.
- Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press.
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3.
- Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge.
- Watkins, C. J. C. H., + Dayan, P. (1992). Q-learning. Machine Learning 8.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8.
- Tsitsiklis, J. N., + Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE TAC 42.
- Sutton, R. S., McAllester, D., Singh, S., + Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS.
Deep RL:
- Mnih, V., et al. (2013). Playing Atari with deep reinforcement learning. arXiv:1312.5602.
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature 518.
- Van Hasselt, H., Guez, A., + Silver, D. (2016). Deep reinforcement learning with double Q-learning. AAAI.
- Wang, Z., et al. (2016). Dueling network architectures for deep reinforcement learning. ICML.
- Schaul, T., et al. (2016). Prioritized experience replay. ICLR.
- Bellemare, M. G., Dabney, W., + Munos, R. (2017). A distributional perspective on reinforcement learning. ICML.
- Hessel, M., et al. (2018). Rainbow: combining improvements in deep reinforcement learning. AAAI.
Policy gradient + actor-critic:
- Schulman, J., Levine, S., Moritz, P., Jordan, M., + Abbeel, P. (2015). Trust region policy optimization. ICML.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., + Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
- Schulman, J., et al. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR.
- Mnih, V., et al. (2016). Asynchronous methods for deep reinforcement learning. ICML.
- Lillicrap, T. P., et al. (2015). Continuous control with deep reinforcement learning (DDPG). ICLR.
- Fujimoto, S., van Hoof, H., + Meger, D. (2018). Addressing function approximation error in actor-critic methods (TD3). ICML.
- Haarnoja, T., Zhou, A., Abbeel, P., + Levine, S. (2018). Soft actor-critic. ICML.
Model-based:
- Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529.
- Silver, D., et al. (2017). Mastering the game of Go without human knowledge. Nature 550.
- Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362.
- Schrittwieser, J., et al. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588.
- Hafner, D., et al. (2019). Learning latent dynamics for planning from pixels (PlaNet). ICML.
- Hafner, D., et al. (2020). Dream to control (Dreamer). ICLR.
- Hafner, D., et al. (2023). Mastering diverse domains through world models (DreamerV3). arXiv:2301.04104.
- Chua, K., Calandra, R., McAllister, R., + Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models (PETS). NeurIPS.
Offline + imitation:
- Kumar, A., Zhou, A., Tucker, G., + Levine, S. (2020). Conservative Q-learning for offline reinforcement learning. NeurIPS.
- Kostrikov, I., Nair, A., + Levine, S. (2021). Offline reinforcement learning with implicit Q-learning. ICLR 2022.
- Chen, L., et al. (2021). Decision transformer: reinforcement learning via sequence modeling. NeurIPS.
- Ross, S., Gordon, G., + Bagnell, J. A. (2011). A reduction of imitation learning and structured prediction to no-regret online learning (DAgger). AISTATS.
- Ho, J., + Ermon, S. (2016). Generative adversarial imitation learning. NeurIPS.
RLHF + LLMs:
- Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
- Stiennon, N., et al. (2020). Learning to summarize from human feedback. NeurIPS.
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). NeurIPS.
- Bai, Y., et al. (2022). Constitutional AI: harmlessness from AI feedback. arXiv:2212.08073.
- Rafailov, R., et al. (2023). Direct preference optimization: your language model is secretly a reward model. NeurIPS.
- Shao, Z., et al. (2024). DeepSeekMath: pushing the limits of mathematical reasoning in open language models (GRPO).
Theory + analysis:
- Kakade, S., + Langford, J. (2002). Approximately optimal approximate reinforcement learning. ICML.
- Strehl, A. L., Li, L., + Littman, M. L. (2009). Reinforcement learning in finite MDPs: PAC analysis. JMLR.
- Jaksch, T., Ortner, R., + Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. JMLR.
- Agarwal, A., Kakade, S. M., Lee, J. D., + Mahajan, G. (2021). On the theory of policy gradient methods. JMLR.
- Henderson, P., et al. (2018). Deep reinforcement learning that matters. AAAI.