Reinforcement Learning for Robotics Control — PPO, SAC, Sim-to-Real

Scope. Classical model-based control — [[Robotics/pid-control]], [[Robotics/state-space-lqr]], MPC — designs a controller from a known plant model. RL learns the controller from interaction data with a plant (real or simulated) by maximising a scalar reward. This note is the robotics-applied counterpart: which algorithm to pick, how to design rewards that survive contact with hardware, how to bridge the sim-to-real gap, and the case studies of RL policies actually deployed on Cassie, ANYmal, Champion drone racing, and ALOHA bimanual manipulators. RL theory (MDP foundations, policy-gradient derivations, exploration theory) lives in [[Robotics/rl-for-control]] (planned). Foundational deep-learning machinery lives in [[Robotics/computer-vision-robotics]] (planned).

1. At a glance

RL for control means: learn a policy $π (a ∣ s)$ — typically a small neural network — that maximises expected cumulative reward by trial-and-error in a (usually simulated) plant. The policy is then deployed (zero-shot or after fine-tuning) on the real robot. As of 2024-2026 RL is in production on:

Legged locomotion. Cassie outdoor running (Siekmann 2021, OSU), ANYmal-C blind stairs (Lee 2020), Spot research, Unitree Go2 default firmware ≥ 1.0.20, MIT Mini Cheetah, every recent quadruped paper. See [[Robotics/legged-robotics]] §3.1 for the canonical PPO recipe.
Drone racing. Swift (Kaufmann 2023 Nature) beat three human champions on a fixed track using vision-based PPO at 100 Hz.
Bimanual manipulation. Stanford ALOHA / Mobile ALOHA (Zhao 2023, Fu 2024) — imitation-trained ACT (Action Chunking Transformer) on $32 k bimanual robot.
Generalist VLAs. RT-2 (Brohan 2023, Google), Pi-0 (Physical Intelligence 2024), Octo (Berkeley 2024), Gemini Robotics (2024) — vision-language-action models fine-tuned with RLHF or DPO from human demos.
Industrial pick-and-place. Covariant Brain, Berkshire Grey, Mujin — proprietary RL-fine-tuned grasp policies trained in sim.

Why RL instead of model-based control. Three reasons stack:

Contact-rich tasks. Robust gait on rubble, deformable-object manipulation, gripper closing on irregular geometry — situations where rigid-body MPC with friction cones underestimates the dynamics. RL absorbs the mismatch into the policy via domain randomisation.
Vision-in-the-loop. Mapping raw RGB or depth to action through a single end-to-end neural net is more direct than the segment → object-pose → IK → trajectory pipeline. RL+CNN is one differentiable graph.
Reward as spec. “Walk 1 m/s, don’t fall, minimise joint accelerations” is easier to write than to derive a Hamilton-Jacobi-Bellman solution. The reward is the contract; the policy is the implementation.

Why not RL. Equally important:

Sample efficiency. Real-robot RL is data-starved: a manipulator collects 1 episode/s; a quadruped 10 episodes/s in sim, 0.1 episodes/s on real hardware. Anything that solves with a known model (flat-ground mobile-base path tracking — PID + MPC) should use the model.
Reward hacking. The policy will find every exploit. A walking-reward bug → robot scoots on its belly. A clearance-reward bug → drone flips upside-down between gates because the reward only checks the ring centre.
Sim-to-real gap. The dominant failure mode. A policy that wins in sim falls over on real ground because friction, motor latency, sensor noise, and unmodelled compliance differ from the simulator.
Brittleness. Out-of-distribution states (unseen terrain, unseen object) → undefined behaviour. Model-based controllers degrade gracefully; learned policies can be unstable.

Where it sits in the design stack. Sensors → state representation (raw obs or learned feature) → policy network $π_{θ}$ → low-level joint PD or torque → plant. The policy typically outputs target joint positions (not torques) at 50 Hz; an underlying stiff PD closes the inner loop at 1 kHz. See [[Robotics/legged-robotics]] §3.1 — every successful deployed RL controller uses this two-layer pattern, never end-to-end-to-torque.

Hybrid model-based + RL is the 2024-2026 frontier. Pure model-based MPC ([[Robotics/legged-robotics]] Di Carlo 2018) is interpretable and stable; pure RL is robust to model error but opaque. The hybrid pattern — model-based MPC for the planning skeleton + an RL residual policy for the contact and disturbance terms — wins on every metric. ANYmal-C’s “robust” mode is exactly this; so is Boston Dynamics’ “park controller” gait.

First ask before applying RL. Can a calibrated dynamics model + MPC solve this? If yes → use it; cheaper, debuggable, no sim-to-real gap. If no — do I have a fast simulator? (Isaac Lab, MuJoCo, Genesis). If no → consider imitation learning from demonstrations instead. Can I write a reward whose maximum corresponds to my actual goal? If you can’t, neither can the agent. Do I have a real-robot validation budget? Plan for 2–10× sim hours of real-time evaluation before a policy is shippable.

2. First principles

2.1 Markov Decision Process

The problem is formalised as a Markov Decision Process $(S, A, P, R, γ)$ :

$S$ — state space (continuous, $R^{n}$ for robotics: joint pos/vel, base orientation, contact sensors).
$A$ — action space (continuous, $R^{m}$ : target joint positions, torques, or motor PWM).
$P (s^{'} ∣ s, a)$ — stochastic transition kernel; the simulator (or real robot).
$R (s, a)$ — scalar reward, the only signal the agent gets.
$γ \in [0, 1)$ — discount factor. $γ = 0.99$ at 50 Hz gives an effective horizon of ~ 2 s; $γ = 0.995$ gives ~ 4 s. Robot RL uses $γ \in [0.95, 0.999]$ .

Markov property: $P$ depends only on $(s, a)$ , not on history. Robots violate this (motor temperature, unobservable terrain) → either expand the state to include hidden variables or use a recurrent policy.

2.2 Value functions and the Bellman equation

The state-value function $V^{π} (s) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} R_{t} ∣ s_{0} = s]$ is expected return from $s$ under policy $π$ .

The action-value (Q) function $Q^{π} (s, a) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} R_{t} ∣ s_{0} = s, a_{0} = a]$ .

The Bellman expectation equation for $V^{π}$ :

$V^{π} (s) = E_{a \sim π} [R (s, a) + γ E_{s^{'}} [V^{π} (s^{'})]]$

The Bellman optimality equation for $Q^{*}$ (Bellman 1957):

$Q^{*} (s, a) = R (s, a) + γ E_{s^{'}} [max_{a^{'}} Q^{*} (s^{'}, a^{'})]$

Q-learning chases the latter by sampling; policy-gradient methods estimate $\nabla V$ directly.

2.3 Policy gradient theorem (Sutton 2000)

For a parametric policy $π_{θ} (a ∣ s)$ the gradient of expected return decomposes as

$\nabla_{θ} J (θ) = E_{s \sim d^{π}, a \sim π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot A^{π} (s, a)]$

where $d^{π}$ is the on-policy state distribution and $A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$ is the advantage. Subtracting $V$ is a control variate — does not bias the estimator, but reduces variance by orders of magnitude. The estimator in practice uses Generalised Advantage Estimation (Schulman 2016):

$A_{t}^{GAE (γ, λ)} = \sum_{l = 0}^{\infty} (γλ)^{l} δ_{t + l}, δ_{t} = R_{t} + γV (s_{t + 1}) - V (s_{t})$

$λ \in [0.9, 0.97]$ trades bias for variance. PPO and A2C both consume GAE.

2.4 On-policy vs off-policy

On-policy algorithms (REINFORCE, A2C/A3C, PPO, TRPO) use only trajectories generated by the current $π_{θ}$ . Each gradient step requires fresh rollouts → bad sample efficiency, but stable training. Default choice for sim-heavy robotics where rollouts are cheap.

Off-policy algorithms (DQN, DDPG, TD3, SAC) reuse old transitions stored in a replay buffer. Better sample efficiency (5–50× fewer environment steps to convergence) but trickier to stabilise. Default choice for real-robot RL where every interaction costs wall-clock.

2.5 PPO clipped objective (Schulman 2017)

The workhorse for robotics. Define the policy ratio $r_{t} (θ) = π_{θ} (a_{t} ∣ s_{t}) / π_{θ_{old}} (a_{t} ∣ s_{t})$ . The clipped surrogate loss is

$L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]$

with $ϵ = 0.2$ default. The clip prevents single updates from moving the policy too far from $π_{θ_{old}}$ , achieving (cheaply) what TRPO does with a KL trust region. The full PPO objective combines this with a value-function MSE and an entropy bonus:

$L^{PPO} (θ) = - L^{CLIP} + c_{1} (V_{θ} (s) - V_{target})^{2} - c_{2} H (π_{θ})$

Defaults: $c_{1} = 0.5$ , $c_{2} = 0.01$ . The entropy bonus prevents premature collapse to a deterministic policy.

2.6 SAC maximum-entropy objective (Haarnoja 2018)

Soft Actor-Critic maximises return + a policy-entropy bonus:

$J (π) = \sum_{t} E_{(s_{t}, a_{t}) \sim π} [R (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))]$

$α$ is the temperature, learned adaptively to track a target entropy $H_{target} = - dim (A)$ (one bit per action dim — a useful default). The maximum-entropy formulation acts as automatic exploration and gives SAC excellent sample efficiency on manipulation tasks where exploration matters more than stability.

2.7 Model-based RL

Instead of model-free trial-and-error, learn the transition $\hat{P} (s^{'} ∣ s, a)$ from data and plan in the learned model. Three families:

PETS (Chua 2018) — probabilistic ensembles of MLPs; cross-entropy method for planning. State-of-the-art at 100k–1M sample budget.
MuZero (Schrittwieser 2020) — learned latent dynamics + MCTS; mastered Atari + Go + chess + shogi with one architecture.
DreamerV3 (Hafner 2023) — world models in latent space from pixels; one set of hyperparameters across 150+ tasks, no env-specific tuning.

Model-based achieves 10–100× sample efficiency vs model-free at the cost of architectural complexity and bias from model error.

2.8 Imitation learning alternative

When demonstrations are cheap and rewards are hard to design, imitate:

Behaviour cloning (BC) — supervised regression $π (a ∣ s)$ on expert $(s, a)$ pairs. Brittle to compounding error (Ross 2011).
DAgger (Ross 2011) — iteratively query the expert on policy rollouts. Fixes compounding error if expert is available online.
GAIL (Ho 2016) — adversarial: a discriminator distinguishes expert from policy trajectories; policy fools it. No reward needed.
Diffusion Policy (Chi 2023) — predict action sequence via denoising diffusion. SOTA on visuomotor manipulation; handles multi-modal demonstrations.
ACT — Action Chunking Transformer (Zhao 2023) — predicts $k$ -step action chunks via transformer. Used in ALOHA / Mobile ALOHA; 50 demos suffice for many tasks.

2.9 Offline RL

When online interaction is too expensive (real robot, surgical procedure, surgical-robot-style critical systems), learn from a fixed dataset $D = {(s, a, r, s^{'})}$ . The naive approach — fit Q on $D$ then maximise — fails because of distributional shift: the learned Q over-estimates value for out-of-distribution actions. Two dominant fixes:

Conservative Q-Learning (CQL, Kumar 2020) — penalise Q for out-of-dataset actions: $L_{CQL} = L_{TD} + α E_{s \sim D} [lo g \sum_{a} exp (Q (s, a)) - E_{a \sim D} [Q (s, a)]]$ The bracket term is a log-sum-exp upper bound on Q over the action distribution minus Q on dataset actions — pushes Q down on out-of-distribution actions.
Implicit Q-Learning (IQL, Kostrikov 2021) — never evaluates Q on out-of-distribution actions. Fits $V (s)$ via expectile regression (upper expectile of $Q (s, a)$ over the dataset), uses $V$ as the bootstrap target. Policy extraction via advantage-weighted regression. The current SOTA on D4RL.

For real-robot RL, the common pattern is: collect $1 0^{3}$ — $1 0^{4}$ teleoperated episodes offline + train IQL or BC → deploy as warm start → fine-tune online with SAC.

2.10 Sim-to-real gap

The single largest engineering problem in robotics RL. The simulator is a wrong model: friction, motor latency, sensor noise, link compliance, contact impulses, sensor calibration — all differ from reality. Four families of mitigation:

Domain randomisation (Tobin 2017) — randomise simulator parameters (friction, mass, motor strength, latency) each episode so the policy learns a single $π$ that works across the distribution. The current default.
Asymmetric actor-critic (Pinto 2018) — critic sees privileged sim info (full state), actor sees only deployable sensors. Speeds up training without leaking privileged info into deployment.
Teacher–student distillation (Lee 2020) — train teacher with privileged terrain/state info in sim, distill to a student that sees only proprioceptive + exteroceptive sensors. ANYmal blind-stairs recipe.
Real-data fine-tuning — collect a small dataset on real hardware, fine-tune with offline RL (CQL, IQL) or behaviour cloning. RMA (Kumar 2021) trains an “adaptation module” online from a short history.

3. Practical math & worked examples

Example A — PPO loss for a quadruped (numbers)

Quadruped trained with PPO at 50 Hz policy rate, 1 kHz inner PD. Observation $s \in R^{45}$ : 12 joint pos + 12 joint vel + 3 base ang-vel + 3 gravity-in-body + 3 commanded velocity + 12 last-action.

Action $a \in R^{12}$ : target joint positions, fed to inner PD with $K_{p} = 25$ N·m/rad, $K_{d} = 0.5$ N·m·s/rad.

Policy network: 3-layer MLP, hidden sizes 512 → 256 → 128, ELU activation. Output head produces $(μ, lo g σ)$ ; action sampled from $N (μ, σ^{2})$ during training, $μ$ used at deployment.

PPO hyperparameters (Margolis 2024 / legged_gym defaults):

clip_ε         = 0.2
γ              = 0.99
λ (GAE)        = 0.95
PPO epochs     = 5
mini-batches   = 4
learning rate  = 1e-3 (with adaptive decay on KL)
entropy coef   = 0.01
value coef     = 1.0
max grad norm  = 1.0
parallel envs  = 4096    (Isaac Gym, RTX 4090)
steps / env    = 24      → batch of 98 304 transitions per PPO update

Wall-clock training budget: 12 000 PPO iterations × ~ 1.5 s/iter ≈ 5 hours on a single RTX 4090 to reach robust trot on flat ground; 12 hours to add stairs and slopes. Total environment steps: $1.2 \times 1 0^{9}$ . This is 4 orders of magnitude more interaction than the real robot could ever provide — explaining why every legged RL pipeline trains in sim.

Compute breakdown. GPU peak: ~ 280 W. Simulator step (Isaac Gym physics + PD): 0.6 ms / step at 4096 envs → 6800 env-steps/s/env equivalent → 28 M env-steps/s aggregate. Policy forward + backward: 0.4 ms / minibatch. PPO update: ~ 80 ms wall-clock per iteration (20 grad steps).

Example B — Reward shaping for a manipulator reach task

Task: 7-DoF arm (Franka FR3) reach end-effector to target $p^{*} \in R^{3}$ .

Sparse reward: $R_{t} = {+ 10 0 if ∥ p_{ee} - p^{*} ∥ < 0.02 m otherwise$ With this, PPO requires $\sim 5 \times 1 0^{7}$ env-steps to first success.

Shaped reward decomposed into three weighted terms:

$R_{t} = R_{task} + R_{smooth} + R_{safety}$

$R_{task} = - ∥ p_{ee} - p^{*} ∥_{2}^{2} + 101 [∥ p_{ee} - p^{*} ∥ < 0.02]$ $R_{smooth} = - α ∥ \overset{q}{˙} ∥_{2}^{2} - β ∥ a_{t} - a_{t - 1} ∥_{2}^{2}$ $R_{safety} = - γ_{coll} 1 [collision] - γ_{lim} \sum_{j} (ReLU (∣ q_{j} ∣ - q_{j, m a x}))^{2}$

With $α = 1 0^{- 3}$ , $β = 1 0^{- 2}$ , $γ_{coll} = 100$ , $γ_{lim} = 10$ . With shaped reward PPO converges in $\sim 3 \times 1 0^{6}$ env-steps — 17× faster than sparse.

The danger: the shaping creates a different optimum from the sparse reward. Over-weighting $R_{smooth}$ produces a stationary policy (“don’t move, get 0 instead of negative”). Standard remedy: anneal $α, β$ from 0 over the first $1 0^{6}$ steps so the agent first learns the task, then learns to do it gently.

Example C — SAC for a Franka pick-and-place (off-policy real-robot setup)

Task: 7-DoF Franka FR3 picks a cube from a $0.3 \times 0.3$ m workspace and places it on a target pad. Real-robot training (no sim).

Observation $s \in R^{29}$ : 7 joint pos + 7 joint vel + 3 ee pos + 4 ee quat + 3 cube pos + 4 cube quat + 1 gripper width.

Action $a \in R^{8}$ : 7 joint-velocity targets + 1 gripper command, all in $[- 1, 1]$ after tanh squash.

Hyperparameters (SB3 SAC defaults plus Franka-specific):

γ                = 0.99
τ (target update) = 0.005
batch size        = 256
replay buffer     = 1 000 000 transitions
learning rate     = 3e-4 (actor, critic, alpha)
target entropy    = -dim(A) = -8
gradient steps    = 1 per env step
warmup steps      = 10 000 (random policy)
policy network    = 2 × 256 MLP, ReLU
Q-network         = 2 × 256 MLP, ReLU (twin)

At a real-robot rate of ~ 1 episode/30 s (5 s reset + 25 s rollout), $1 0^{5}$ env-steps takes ~ 14 hours. Convergence to 80 % pick success: $\sim 5 \times 1 0^{4}$ steps. Off-policy data reuse means each transition contributes to ~ 50 gradient updates over the run — the source of SAC’s sample efficiency advantage over PPO.

Safety wrappers (mandatory): cartesian-velocity clip at 0.5 m/s; joint-torque clip at 80 % rated; force-sensor termination if $F_{z} > 30$ N during free motion; auto-recovery routine that pushes the gripper to a safe home pose on termination.

Example D — Domain-randomisation budget for an ANYmal-class quadruped

Target: zero-shot transfer from Isaac Gym to real ANYmal. The DR distribution (Lee 2020 / Margolis 2024):

Parameter	Sim default	Randomisation
Ground friction $μ$	1.0	$U (0.4, 1.2)$
Restitution $e$	0.0	$U (0.0, 0.3)$
Payload mass added to base	0 kg	$U (- 1, + 3)$ kg
Centre-of-mass offset	0	$U (- 5, + 5)$ cm × 3 axes
Motor $K_{p}$ scale	1.0	$U (0.8, 1.2)$
Motor $K_{d}$ scale	1.0	$U (0.8, 1.2)$
Motor torque saturation	33.5 N·m	$U (0.8, 1.2) \times 33.5$
Observation latency	0 ms	$U (0, 25)$ ms
IMU gyro noise	0	$N (0, 0.1)$ rad/s
IMU accel noise	0	$N (0, 0.5)$ m/s²
Push velocity (every 8 s)	—	$U (- 1, + 1)$ m/s × 3 axes

Without DR: sim2real success rate < 20 %, robot falls within 10 s. With DR as above: $\sim 90%$ success on flat terrain, $\sim 60%$ on stairs without an adaptation module, $\sim 95%$ with RMA-style online adaptation (Kumar 2021).

4. Design heuristics

4.1 Algorithm selection

Situation	Pick	Rationale
Continuous action, sim available, lots of rollouts	PPO	Stable, scales to 1000s of parallel envs, the field default
Continuous action, sample-efficient needed (real robot)	SAC or TD3	Off-policy + entropy = best sample efficiency on manipulation
Discrete action (board games, Atari, gait selection)	Rainbow DQN	Discrete value-based with all the bells
Expert demonstrations available	ACT, Diffusion Policy	Imitation beats RL when demos exist
Few hundred episodes only (real robot)	Offline RL: IQL, CQL	Avoids querying the env; learns from fixed data
Pixel input, hard exploration	DreamerV3	World model + latent rollouts; SOTA on hard exploration
Trajectory optimisation under known model	iLQR / DDP	Not RL; the right tool when the model is good
Hybrid model-based + learned residual	MPC + RL residual	Best of both, the 2025-26 frontier

4.2 Reward engineering — the hardest part

Reward design dominates RL success. Five rules from the legged-robot literature (Margolis 2024, Lee 2020, Hwangbo 2019):

Decompose into task / safety / smoothness / regularisation. Each term has its own weight; tune in isolation, then together.
Use squared L2 for tracking, hinge / ReLU for limits, exponential for soft success. $r = exp (- ∥ v - v^{*} ∥^{2} / σ^{2})$ saturates near the goal, avoiding unbounded gradients.
Curriculum the weights. Anneal smoothness penalties from 0 over $1 0^{6}$ steps; otherwise the agent prefers stillness to risk.
Penalise jerk, not just acceleration. $∥ a_{t} - a_{t - 1} ∥^{2}$ enforces smooth action sequences and is the single most important addition for sim-to-real.
Add survival bonus. $+ 0.15$ per step alive. Pushes the agent toward not-falling before it has the gait figured out.

Anti-patterns:

Hand-crafted “if-else” reward → reward hacking guaranteed.
All-or-nothing sparse reward in long-horizon tasks → no learning signal.
Reward with unbounded magnitude (e.g. $1/∥ p - p^{*} ∥$ ) → numerical instability, exploding gradients.

4.3 Observation and action design

Observation space — what the policy actually sees:

Joint pos + joint vel — universal; always include.
Base ang-vel + gravity-in-body-frame — for floating-base robots. Don’t include base linear velocity; it’s noisy and unobservable. Let the policy infer from history.
Last action — $a_{t - 1}$ in the observation makes the policy implicitly recurrent and stabilises training.
Commanded velocity / goal — the user input.
Contact sensors — 0/1 per foot; helps gait-aware policies.
Privileged in sim only — friction, payload, terrain heights. Critic sees this, actor does not (asymmetric AC). At deploy time, train an adaptation module to estimate these from history.

Action space — what the policy outputs:

Target joint positions to a stiff PD (the legged default). Stable, smooth, easy to clip. $K_{p} = 20$ – $80$ N·m/rad, $K_{d} = 0.5$ – $2.0$ N·m·s/rad.
Joint torques directly — harder to train, more sim-to-real fragile, but allows fully exploiting actuator bandwidth. Used by MIT for backflips (Katz 2019).
Target end-effector velocity / Cartesian wrench — for manipulation. Add IK or operational-space mapping inside the env.
PWM duty — never. Too low-level; couples policy to actuator electronics.

Normalisation. Always normalise observations to ~ $N (0, 1)$ . Running-mean / running-std on each component, frozen at deploy. Actions scaled to $[- 1, 1]$ with tanh squash on the policy output. Skipping this is the #1 cause of unstable training in beginner pipelines.

4.4 Sim choice

For RL specifically:

Isaac Lab / Isaac Gym (NVIDIA) — GPU physics + GPU policy, 4096+ parallel envs on one RTX 4090, 100 k env-steps/s aggregate. The default for legged-robot RL since 2022.
MuJoCo (Google DeepMind, Apache-2 since 2021) — CPU only but accurate contact. The default for manipulation and analytical-baseline papers.
Genesis (CMU/Sea AI Lab, Dec 2024) — GPU physics, ~ 100× MuJoCo single-CPU; rapidly adopted in 2025.
Brax (Google, JAX) — fully differentiable simulator; differentiable through contacts.
RaiSim (ETH) — single-CPU but very fast on contact; the original ANYmal training stack.
MuJoCo MPC / MuJoCo Playground (DeepMind 2024) — combined sim + RL training infrastructure with predictive-sampling MPC integration.

GPU sims (Isaac, Genesis, Brax) only beat CPU sims if your policy is small enough that data-loading is not the bottleneck. A 1M-param MLP at 4096 envs is comfortably GPU-bound; a 50M-param CNN starts saturating PCIe.

4.5 Sim-to-real bridge

Five techniques, in order of how much they buy you:

Domain randomisation on the parameters that matter most: friction, motor strength, latency, base mass. The single biggest lever.
Asymmetric actor-critic: privileged info to the critic only.
Teacher-student distillation (Lee 2020): train a teacher with full info; distill to a deployable student via DAgger. Used for ANYmal blind stairs.
RMA-style online adaptation (Kumar 2021): train a context encoder that maps a short observation history to a latent representation of the environment; deploy and adapt online.
Real-data fine-tuning: collect 100–1000 real episodes; fine-tune with IQL or CQL. Used by Berkeley Octo for the last 5 % of performance.

A policy that fails sim-to-real almost always reveals a missing randomisation: motor delay is the most common omission, sensor latency the second.

4.6 Safety in real-robot RL

Action limits at the env layer, not the policy. Clip target joint positions to a safe envelope before passing to the inner PD.
Termination on safety violation — base angle > 60°, base height < 0.3 m, joint torque > 95 % rated for > 100 ms.
Recovery policy — separate, conservative policy invoked when the main policy violates safety; resets the robot to a known-safe pose.
Soft barriers in the reward rather than hard constraints — penalise approach to limits before they are hit.
Shielded RL / constrained-MDP for hard guarantees: Achiam 2017 CPO, Stooke 2020 PID Lagrangian. Used by Tesla for FSD planning.

4.7 Curriculum learning patterns

Most non-trivial robotics RL pipelines fail without curriculum. Four patterns dominate:

Terrain curriculum (Lee 2020 / legged_gym). 10 difficulty levels of terrain (flat → slopes → stairs → gaps → rough). Each env is assigned a level; advances when episode return exceeds 80 % of max, demotes when below 30 %. Rolling success rate determines progression.
Command curriculum — start with slow commanded velocities ( $∥ v^{*} ∥ < 0.5$ m/s), widen to full range ( $∥ v^{*} ∥ < 3$ m/s) over $1 0^{6}$ steps. Prevents the policy from over-committing to high-speed gaits before slow-speed control is solid.
Reward curriculum — anneal penalty terms from 0 to target weight over $1 0^{6}$ steps. Smoothness, action-rate, joint-acceleration penalties are common annealing targets.
Task curriculum — task is broken into subgoals (reach → grasp → lift → place). Promote when prior subgoal is reliably solved. Used in OpenAI Rubik’s cube (2019) and Mobile ALOHA training.

The wrong curriculum gets stuck. Common pathologies: a fixed-pace schedule advances faster than the policy learns (collapse); a success-triggered schedule with too-high a threshold (gets stuck on a hard level forever). The fix is adaptive difficulty: tie the schedule to a moving-average success rate with both promotion and demotion thresholds.

4.8 Hyperparameter sensitivity

Robotics RL is famously sensitive to hyperparameters. Henderson 2018 (“Deep RL that matters”) showed that the same algorithm with different seeds can differ in final performance by 50 %+. The hyperparameters most worth tuning, in order:

Learning rate — typical PPO range $1 0^{- 4}$ to $3 \times 1 0^{- 3}$ ; use KL-adaptive decay.
Entropy coefficient — 0.0 to 0.05; if policy collapses, raise; if returns plateau without converging, lower.
Batch size (mini-batch within PPO) — larger = lower variance, slower wall-clock; ~ 16 k for legged, ~ 4 k for manipulation.
GAE λ — 0.9 to 0.97; lower = less variance, more bias.
Discount γ — 0.95 to 0.999; longer horizons need higher γ but also slower training.
Clip ratio ε — PPO default 0.2; rarely worth tuning.

For SAC the analogue is target entropy + replay-buffer size + ratio of gradient steps per env step (UTD, “updates to data”). Higher UTD = more sample-efficient but unstable.

4.9 When NOT to use RL

Flat-ground mobile-base path tracking → [[Robotics/pid-control]] + [[Engineering/mpc-control]].
Linear known plant (drone hover) → [[Robotics/state-space-lqr]].
Trajectory tracking with a known model and constraints → MPC.
Single-task manipulation with a fixed object → scripted pose + force control.
Anywhere the dynamics model is good and the cost is convex → use the model.

RL pays off where contact richness, vision, or task generality exceed what a model can capture.

4.10 Anatomy of a deployable RL pipeline

The end-to-end pipeline for a sim-trained, real-deployed RL policy has roughly twelve stages. Skipping any of them is the most common failure mode in undergraduate research projects:

URDF / MJCF validation. Inertias, link masses, joint limits, joint friction, joint damping must match the real robot to within ~ 10 %. Use the mjcf_to_urdf round-trip and visualise; pay particular attention to inertia tensors (often wildly wrong if generated from CAD without manual checking).
Actuator model. First-order torque dynamics: $\overset{τ}{˙} = (τ_{cmd} - τ) / τ_{a}$ with $τ_{a} \approx 5$ ms for QDD. Add current limit + torque-speed curve. Without this, sim-to-real fails on aggressive maneuvers.
Observation pipeline. Define the exact observation vector that will be available at deploy. Add Gaussian noise + delay during training to match the real stack.
Reward design + scalar-balance check. Histogram every reward component over a random rollout; ensure no component dominates by more than 10×.
PD baseline. Verify that the inner PD on top of the policy can hold position against a 10 N disturbance. If not, the policy will fight the PD.
Sanity train on a simpler task. Stand-up reward only; verify policy learns to stand within $5 \times 1 0^{7}$ env-steps.
Full-task training with curriculum and domain randomisation.
Sim deploy test. Evaluate in a held-out sim with stronger DR than training; success rate $> 80%$ is the minimum bar.
Real-robot setup. Real-time policy inference (~ 2 ms budget at 50 Hz); action-clipping safety wrapper; emergency-stop on safety violation.
Real-robot evaluation. 50-episode baseline test; record everything (joint logs, video, IMU). Compare to sim distribution.
Targeted DR refinement if sim-to-real gap is wide. Common: motor latency was 0 in sim but 8 ms in reality; widening DR closes the gap.
Production wrap. ROS 2 node or equivalent integration, watchdog, telemetry, recovery policy.

Items 1–4 are pre-training; items 5–8 are sim development; items 9–12 are deployment. A typical first-time pipeline takes 2–3 months of engineer-time; mature labs (ETH RSL, MIT Biomimetic, Berkeley Robotics) reduce this to ~ 2 weeks for an incremental policy.

5. Components & sourcing

5.1 RL libraries

Library	Language	Algorithms	Best for	License
Stable Baselines 3	PyTorch	PPO, SAC, TD3, A2C, DQN, HER	Education, baselines	MIT
rl_games	PyTorch	PPO, SAC	Isaac Gym / Lab default backend	MIT
Tianshou	PyTorch	30+ algorithms	Research benchmarks	MIT
RLlib (Ray)	PyTorch / TF	All major	Distributed, multi-agent	Apache 2
Acme (DeepMind)	JAX / TF	All major	DeepMind-style research	Apache 2
CleanRL	PyTorch	Single-file impls	Reproducibility, teaching	MIT
Brax (Google)	JAX	PPO, SAC, ES	Differentiable-sim RL	Apache 2
PufferLib	PyTorch	High-throughput	Maximising env steps/s	MIT
TorchRL (Meta)	PyTorch	All major	Modular, integrated with PyTorch	MIT
JaxRL / SBX	JAX	PPO, SAC	Speed	MIT

For robotics specifically, rl_games + Isaac Lab + legged_gym (leggedrobotics/legged_gym, ETH) is the canonical PPO pipeline; SB3 + Gymnasium + MuJoCo is the manipulation-research default; CleanRL is for “I want to read every line of the algorithm.”

5.2 Simulators

Simulator	Vendor	Strength	Notes
Isaac Lab	NVIDIA, 2024	GPU-parallel, photoreal	Successor to Isaac Gym; current default
MuJoCo	DeepMind (Apache 2 since 2021)	Contact accuracy	Industry standard for RL papers
Brax	Google	Differentiable, JAX	Faster than MuJoCo on GPU
Genesis	CMU + Sea AI Lab, 2024	Fastest GPU physics	~ 100× MuJoCo, claimed
PyBullet	OSS (Bullet Physics fork)	Free, slow	Education and legacy code
SAPIEN	UCSD	Articulated objects	Manipulation-focused
Gazebo (Garden, Harmonic)	Open Robotics	ROS 2 integration	Whole-system tests; weaker contact
Drake	TRI / MIT	Hydroelastic contact	Formal-methods-friendly
RaiSim	ETH	Single-CPU legged	Commercial license
MuJoCo MPC / Playground	DeepMind 2024	Sim + MPC + RL	Predictive-sampling integration

5.3 Robot model libraries

MuJoCo Menagerie (DeepMind) — validated MJCF models for 30+ robots: Spot, Go2, H1, ANYmal, Cassie, UR5, Franka FR3, Aloha. The reference distribution.
Isaac Lab assets — USD models for all major commercial robots.
Robosuite (UT Austin) — manipulation tasks + objects + several arms.
Robocasa (NVIDIA 2024) — kitchen-scale manipulation tasks.

5.4 Benchmarks

Benchmark	Domain	Notes
DeepMind Control Suite	Continuous control	The 1990s/2000s baselines
Meta-World	Manipulation	50 tasks, multi-task RL
D4RL	Offline RL	The canonical fixed-dataset benchmark
RLBench	Manipulation	100 PyRep / CoppeliaSim tasks
ManiSkill (UCSD)	Manipulation, real-transfer	Pick-place, peg insertion
Habitat (Meta)	Navigation	Indoor photorealistic
CARLA	Driving	Urban driving sim
Procgen	Generalisation	Procedurally-generated 2D
NetHack Learning Environment	Hard exploration	The hardest unsolved RL benchmark

5.5 Real-robot RL frameworks

Isaac Manipulator + cuMotion (NVIDIA 2024) — production manipulation policy with optional RL fine-tuning; GPU-parallel collision-free motion + learned grasp.
PSL / Pearl (Berkeley) — academic real-robot RL.
ALOHA / Mobile ALOHA datasets — Zhao 2023, Fu 2024; open-source low-cost bimanual + the Mobile ALOHA wheeled base.
Open-X-Embodiment (collaboration, 2023–24) — 22-institution dataset of 1M+ teleoperated episodes across 22 robot embodiments.

5.6 Foundation models for robotics (VLA — Vision-Language-Action)

Model	Org	Year	Notes
RT-2	Google DeepMind	2023	First VLA; PaLM-E backbone, language → action
OpenVLA	Stanford	2024	Open 7B-param VLA, fine-tunable
Octo	Berkeley	2024	Generalist policy, 800k+ episodes, transformer
Pi-0 ( $π_{0}$ )	Physical Intelligence	2024	Flow-matching VLA; deployed on ALOHA + Franka
Gemini Robotics	Google	2024	Multimodal LM + manipulation head
Helix	Figure AI	2025	Bimanual mobile humanoid foundation model
GR00T	NVIDIA	2024	Humanoid foundation model

These are not strictly RL — most are imitation-trained on teleoperation, with RL or DPO fine-tuning for the last few percent. The RL community is converging on RL fine-tuning of pretrained VLAs as the dominant paradigm for general manipulation.

6. Reference data

6.1 Algorithm comparison

Algorithm	On/off-policy	Action	Sample efficiency	Stability	Default for
REINFORCE	on	both	very low	very low	teaching
A2C / A3C	on	both	low	low	distributed teaching
PPO	on	both	medium	high	legged, drone, sim-heavy
TRPO	on	both	medium	high	predecessor to PPO
DDPG	off	continuous	medium	low	superseded by TD3/SAC
TD3	off	continuous	high	medium	manipulation, real-robot
SAC	off	continuous	high	high	manipulation, sample-limited
DQN	off	discrete	medium	medium	discrete control, games
Rainbow DQN	off	discrete	high	high	Atari, discrete RL benchmark
MuZero	model-based	discrete + continuous	very high	medium	games, planning
PETS	model-based	continuous	very high	medium	low-data control
DreamerV3	model-based	both	very high	high	pixel-based, generalist

6.2 Simulator comparison

Simulator	GPU?	Parallel envs	Contact accuracy	Cost
Isaac Lab	yes	4096+	medium-high (PhysX)	free
MuJoCo	no (CPU)	hundreds	excellent	free (Apache 2)
Brax	yes (JAX)	1024+	medium	free
Genesis	yes	thousands	high	free
PyBullet	no	dozens	medium	free
RaiSim	no (multithread CPU)	dozens	excellent	commercial
SAPIEN	partial	hundreds	medium-high	free
Gazebo	no	one	medium (ODE/DART)	free
Drake	no	one	excellent (hydroelastic)	free

6.3 Common reward components (legged + manipulation)

Term	Form	Typical weight	Purpose
Linear vel tracking	$exp (- ∥ v - v^{*} ∥^{2} /0.25)$	$+ 1.0$	task
Angular vel tracking	$exp (- (ω_{z} - ω_{z}^{*})^{2} /0.25)$	$+ 0.5$	task
Action rate	$- ∥ a_{t} - a_{t - 1} ∥_{2}^{2}$	$- 0.01$	smoothness
Joint acceleration	$- ∥ \overset{q}{¨} ∥_{2}^{2}$	$- 2.5 \times 1 0^{- 7}$	smoothness
Joint torque	$- ∥ τ ∥_{2}^{2}$	$- 1 \times 1 0^{- 5}$	energy
Foot air-time	$+ \sum_{i} (t_{air, i} - 0.5)$ if $0.2 < t_{air} < 0.5$	$+ 1.0$	gait
Foot contact force	$- \sum_{i} ∥ F_{i} ∥^{2}$	$- 1 \times 1 0^{- 3}$	impact gentle
Survival	$+ 1$ per step alive	$+ 0.15$	early bootstrap
Base height	$- (h_{b} - h^{*})^{2}$	$- 1.0$	posture
Termination	$- 200$ if fallen	flat penalty	safety
Reach success	$+ 10$ if $∥ p_{ee} - p^{*} ∥ < 0.02$	sparse	task success
Collision	$- 100$ flag	$- 100$	safety
Joint-limit hinge	$-(\mathrm{ReLU}(	q	- q_{\max}))^2$

6.4 Sim-to-real techniques

Technique	What it does	When it helps most
Domain randomisation (dynamics)	Vary friction, mass, $K_{p}$ , latency	Locomotion on unknown terrain
Domain randomisation (visual)	Vary lighting, texture	RGB-input policies
Asymmetric actor-critic	Critic sees privileged info	Speeds training without leak
Teacher–student distillation	Student deployable; teacher uses privileged	ANYmal blind stairs (Lee 2020)
RMA (rapid motor adaptation)	Online adaptation module	Mid-deployment terrain change
System identification	Calibrate sim to real	Manipulation, drone hover
Real-data fine-tuning (offline RL)	Polish in real	Last 5 % manipulation perf
Action / observation noise	Train with the noise you’ll see	Always
Action smoothing (low-pass) at deploy	Reject HF policy noise	Vibration-prone hardware

6.5 Offline RL algorithm comparison

Algorithm	Year	Key idea	Use when
BC	1990s	Supervised regression on $(s, a)$	Pure imitation, low compounding error
BCQ	2019	Constrain policy to dataset support	First offline RL with neural nets
CQL	2020	Penalise OOD-Q via log-sum-exp	Stable, common D4RL baseline
IQL	2021	Expectile regression, no OOD eval	Current SOTA, lower hyperparameter sensitivity
AWAC	2020	Advantage-weighted policy update	Online → offline transition
TD3+BC	2021	TD3 + behaviour-clone regulariser	Cheap, often competitive with CQL
Diffusion-QL	2022	Diffusion policy + Q-learning	Multi-modal action distributions
Decision Transformer	2021	Conditional sequence model on return	Transformer-native, good with long traj
CRR	2020	Critic-regularised regression	DeepMind production offline

6.6 Hyperparameter starter pack (PPO, legged robot)

γ = 0.99, λ = 0.95
clip = 0.2, value clip = 0.2
lr = 1e-3 with KL-adaptive decay (target KL = 0.01)
n_envs = 4096 (Isaac Lab)
n_steps = 24
n_epochs = 5, n_minibatches = 4
entropy_coef = 0.01 (annealed to 0.001 over 5 000 iter)
value_coef = 1.0
grad_clip = 1.0
hidden = [512, 256, 128], ELU
action_scale = 0.25 (target joint delta in rad)
init_log_std = 0.0
network init = orthogonal (gain = sqrt(2))

6.7 Training budgets (deployed robotics RL pipelines)

System	Algorithm	Sim	Parallel envs	Env-steps	Wall-clock	GPU
ANYmal blind stairs (Lee 2020)	PPO teacher + DAgger student	RaiSim	200	$\sim 4 \times 1 0^{8}$	4 days	none (CPU)
MIT Mini Cheetah (Margolis 2023)	PPO	Isaac Gym	4096	$1 - 2 \times 1 0^{9}$	5–12 h	1× RTX 3090
Unitree Go2 (default 2024 firmware)	PPO + RMA	Isaac Gym	4096	$\sim 1 0^{9}$	8–12 h	1× RTX 4090
Swift drone (Kaufmann 2023)	PPO + sim-real residual	Custom Flightmare	100	$\sim 5 \times 1 0^{7}$ + 50 real episodes	24 h sim + 1 day real	1× RTX
Cassie sim2real running (Siekmann 2021)	PPO + recurrent	MuJoCo	64	$\sim 1 0^{9}$	7 days	1× V100
ALOHA pick-place (Zhao 2023)	ACT (imitation)	n/a (real only)	1	50 demos	hours	1× RTX 3090
RoboPianist (Zakka 2023)	SAC	MuJoCo	64	$\sim 1 0^{8}$	2 days	1× A100

7. Failure modes & debugging

7.1 Symptom → cause table

Symptom	Likely cause	Fix
Policy exploits a reward bug	Reward hacking	Redesign reward; add safety term; freeze incentive that’s exploited
Policy collapses to constant action	Mode collapse / entropy too low	Raise entropy coef; lower temperature (SAC); larger batch
Sim works, real falls	Sim-to-real gap	Harder DR on the parameter that differs; add asymmetric AC; teacher-student
Catastrophic forgetting during fine-tune	Distribution shift	Replay buffer of old states; LoRA-style param-efficient fine-tune
Curriculum stalls at level N	Wrong difficulty schedule	Use adaptive curriculum (success-rate triggered)
Random training crash mid-run	Optimiser instability	Save checkpoints every 100 iters; LR warmup
Return variance huge in PPO	Batch too small or LR too high	Larger batch (≥ 32k); more PPO epochs (5–10); KL-adaptive LR
Sparse reward never trains	No learning signal	BC warm-start; curiosity bonus; reward shaping
Actions saturate (always at $\pm 1$ )	Output not squashed; action scale wrong	Tanh on output; rescale env’s action limits
Privileged info leaks to student	Observation spec error	Strict obs-spec assertion in env wrapper
Perfect-torque simulator artifact	Idealised motor model	Add motor latency, current limit, friction model
Non-reproducible across GPUs	Non-deterministic CUDA	`torch.backends.cudnn.deterministic = True`; seed everything
VRAM OOM at 4096 envs	Replay or trajectory buffer too big	Reduce envs or grad-accumulate
Slow eval blocks training	Eval too often	Eval in single env at end of each PPO update only
Policy oscillates at deploy	HF noise in policy output	Low-pass filter actions (1st-order, $τ = 30$ ms)
Sudden gait change mid-episode	Reward shifts as curriculum advances	Anneal reward weights over $1 0^{6}$ steps

7.2 Detailed failure modes

Reward hacking. The agent will find every reward bug. Lee 2020 (ANYmal) initially saw the policy “trot” by oscillating its base — exploiting an angular-velocity-tracking term that didn’t constrain CoM motion. Fix: add a base-pitch penalty and a survival bonus; require both directional velocity and progress.
Mode collapse. A SAC policy with too-low entropy temperature converges to a single trajectory regardless of state. Fix: set $H_{target} = - dim (A)$ (auto-temperature) or raise $α$ manually. For PPO, raise the entropy coefficient from 0.01 to 0.05 and check that policy std doesn’t collapse below 0.05.
Sim-to-real failure — actuator latency. The most common omission. Real motors have 1–25 ms torque-delivery latency from current-command to actual torque; ideal simulators give zero. A policy that depends on instantaneous torque response will oscillate on hardware. Fix: include $U (0, 25)$ ms action latency in DR.
Catastrophic forgetting at fine-tune. Fine-tuning a sim-trained policy on real episodes causes it to forget sim-only skills (e.g. recovery from large pushes). Fix: keep a replay buffer of sim-collected states; mix sim + real in fine-tune batches at 4:1.
Curriculum stuck. A fixed terrain progression schedule fails when the policy plateaus at a level. Fix: use a success-rate-triggered curriculum (graduate to next level when running success ≥ 70 %); demote on regress.
High-variance PPO returns. Default PPO with 1024 parallel envs has noisy advantage estimates. Fix: increase parallel envs to 4096+, use 5 PPO epochs over 4 mini-batches, and clip the value-function update with the same $ϵ = 0.2$ as the policy.
Sparse-reward training fails. Cannot bootstrap with no signal. Fix: behaviour-clone the first $1 0^{5}$ steps from a scripted policy; add a curiosity bonus (RND, Burda 2018); use Hindsight Experience Replay (HER, Andrychowicz 2017) for goal-conditioned tasks.
Action saturation. Network output unbounded; tanh missing. Fix: always apply tanh squash at the policy output; scale action range explicitly in env.
Privileged info leaks. Asymmetric actor-critic — the actor accidentally gets privileged data via shared parameters. Fix: separate actor and critic networks completely; explicit assert obs_actor.shape == expected in the env step.
Sim-only artifacts. Perfect contact restitution = 0 in default MuJoCo; real foot strikes have $e \approx 0.1$ – $0.3$ . Real-world impacts feel “harder” than sim. Fix: randomise restitution; add foot compliance to the URDF.
Non-reproducibility. Same code, different machine, different result. Fix: seed Python, NumPy, PyTorch, CUDA; pin simulator version; record git hash + container image; the canonical Stable Baselines 3 / CleanRL reproducibility checklists are good references.
VRAM exhaustion. With 4096 parallel envs + large observation vectors + a transformer policy, even an RTX 4090 (24 GB) runs out. Fix: reduce envs to 2048; gradient-accumulate over 2 micro-batches; use bfloat16 for the simulator output.
Slow eval drags training. Evaluating in single env every iteration takes seconds; with 4096-env training one PPO iter is < 2 s — eval becomes the bottleneck. Fix: eval only every 100 iterations, in a single env.
Deployed policy oscillates. HF noise in policy output causes motor whine and torque chatter. Fix: 1st-order low-pass filter on actions with $τ \approx 30$ ms; or train with explicit action-rate penalty.
Gait spontaneously changes mid-episode. Curriculum reward shift mid-training. Fix: anneal smoothly over $1 0^{6}$ steps, never step.

7.3 Quick debug procedure for a stuck pipeline

Verify env determinism. Set seed, take 100 steps with fixed actions; confirm rollouts identical across runs.
Check reward bounds. Histogram every reward component over a random rollout. Any > $1 0^{2}$ in magnitude dominates; rescale.
Check action distribution. At $5 \times 1 0^{5}$ steps, action std should be $> 0.1$ on every dim. If not → entropy too low.
Check value-function fit. $V_{θ} (s)$ predicting return should have $R^{2} > 0.5$ ; if not, value-loss coef may be too low.
Sanity-train on a known task. Cartpole or Pendulum; PPO/SAC should converge in $1 0^{5}$ steps. If not, the codebase is broken.
Profile env-step rate. Should be $> 1 0^{4}$ steps/s/env in MuJoCo, $> 1 0^{5}$ aggregate in Isaac. If not, env or sim is the bottleneck.

8. Case studies

Hwangbo 2019 (Science Robotics) showed PPO-trained policies on ANYmal could replace ETH’s hand-engineered controller; Lee 2020 extended this to blind traversal of stairs, rubble, slippery slopes with only proprioceptive sensing.

Method. Teacher-student distillation:

Teacher trained in RaiSim with PPO using privileged observations: terrain heightmap around each foot, friction at each contact, contact states, external disturbances. With this info, an MLP policy learns robust gait in $\sim 4 \times 1 0^{8}$ env-steps (~ 4 days on a 20-core CPU; no GPU).
Student — TCN (temporal convolutional network) trained via DAgger to clone teacher’s actions from a 50-step history of proprioception only (joint pos, joint vel, IMU, last action). No terrain info, no contact info, no friction.

The student’s TCN implicitly estimates the privileged variables from history — a context-encoder pattern that became the template for all subsequent legged RL.

Deployment. On real ANYmal-C, the blind student traversed:

20 km of forest trail
14 cm step stairs
25° slopes
Rubble piles
Slippery wet leaves
Snow

Failure modes on real: ice (the friction estimate from history wasn’t enough; later fixed with explicit friction estimation, Bloesch 2018).

Why it matters. This paper established the teacher-student pattern, the privileged-info-in-sim asymmetric approach, and the outdoor robustness benchmark every later legged paper compared against. It also demonstrated that a 5M-parameter neural net could replace a thousand-line C++ MPC + WBC pipeline for legged locomotion.

Case B — Champion-level drone racing (Kaufmann et al. 2023, Nature)

UZH and Intel built Swift, a 1-kg quadrotor that beat three human world-champion FPV pilots on a fixed indoor track. Swift used pure vision (one 640×480 onboard camera) + IMU, no external markers, no motion capture.

Method. Three networks, trained separately:

Gate-pose estimator — supervised CNN from RGB to relative gate corners; trained on $\sim 1 0^{7}$ photorealistic images from Flightmare.
State predictor — Kalman-like RNN fusing IMU + gate detections to 6-DOF pose at 100 Hz.
Control policy — PPO-trained MLP from estimated pose + body rates to motor commands at 100 Hz. Trained in custom physics sim with realistic motor dynamics for $\sim 5 \times 1 0^{7}$ env-steps.

Sim-to-real bridge. Two phases:

Phase 1 — sim training with DR on motor torque (±15 %), drag, mass (±10 %), latency (±5 ms).
Phase 2 — real-world residual fine-tuning: collect 50 real episodes, fit a residual model of (sim → real) discrepancy, retrain the policy with the residual injected into the simulator.

Result. On 25 head-to-head races vs human champions (Alex Vanover, Thomas Bitmatta, Marvin Schäpper), Swift won 15. Best human lap: 17.956 s; Swift best lap: 17.465 s.

Why it matters. First demonstration of RL beating humans on a high-dimensional, vision-in-the-loop, real-physical-world task. The (sim → real residual) bridge is now standard for drones; the architecture is roughly the same as ALOHA’s perception stack but with a different action head.

Case C — Mobile ALOHA bimanual mobile manipulation (Fu, Zhao et al. 2024, Stanford)

Stanford’s ALOHA project (Zhao 2023, Fu 2024) is the canonical low-cost-bimanual / imitation-learning success story. Mobile ALOHA puts the original $32 k A L O H A biman u a l a r m so na w h ee l e d ba se, t o t a l cos t$ 40 k vs $300 k+ for comparable commercial hardware.

Method — ACT (Action Chunking Transformer).

Input: 4 RGB cameras (640×480) + arm joint positions + wheel velocities.
Architecture: ResNet-18 image encoder per camera + transformer encoder + transformer decoder.
Output: 50-step chunk of (14-DoF arm joints + 2-DoF wheel velocities) at 50 Hz; execute first 25, re-plan.
Training: supervised on $\sim 50$ teleoperated demonstrations per task. 1× RTX 3090, ~ 5 h.

Tasks demonstrated. Cooking shrimp; storing a 3-prong pot; using an elevator; rinsing a wine glass; pushing chairs; entering a room; bimanual wiping.

Why it matters. The success-with-50-demos result reframed the field: for many household tasks, imitation learning with a powerful enough policy class (transformer + action chunking) beats RL. The dataset and code are open-source (https://mobile-aloha.github.io/) and have spawned a generation of $30 k -$ 50k bimanual research robots.

RL connection. Mobile ALOHA itself is imitation, but the Physical Intelligence $π_{0}$ model (Black 2024) extends the same architecture with RL fine-tuning + flow-matching action prediction, generalising across 8+ robot embodiments from the Open-X-Embodiment dataset.

Case D — RoboPianist dexterous bimanual piano (Zakka et al. 2023, Google DeepMind + Berkeley)

A pair of simulated 23-DoF Shadow Hands playing piano via SAC. Not a real-robot deployment but a milestone in high-dimensional contact-rich RL: 46 actuated DoF, simultaneous coordination, fingertip-key contacts switching at ~ 10 Hz.

Method.

Simulator: MuJoCo MJX (JAX-accelerated MuJoCo, 2023).
Algorithm: SAC with state-only observation (joint pos, vel, fingertip pos, MIDI note schedule lookahead).
Reward decomposition: note-hit accuracy + key-press timing + energy penalty + finger-collision penalty.
Curriculum: training songs ranked by difficulty (Czerny exercises → Chopin études); progress via success threshold.
Training: $1 0^{8}$ env-steps per piece, 2 days on a single A100.

Result. Plays 150+ pieces from the MIDI dataset at human-comparable accuracy. Open-sourced (https://kzakka.com/robopianist/).

Why it matters. Proves SAC scales to ~ 50-dim action spaces with rich contact when given a good simulator and a decomposed reward. The same architecture (SAC + MJX + decomposed reward + curriculum) is now the template for in-hand reorientation (Chen 2022 OpenAI Rubik’s cube heir).

9. Cross-references

Robotics (this library):

[[Robotics/pid-control]] — the workhorse alternative for tasks where dynamics are known and contact-poor.
[[Robotics/state-space-lqr]] — model-based optimal control; the “control-theory” answer for linear plants.
[[Robotics/legged-robotics]] — RL is now the default locomotion controller; this note is the algorithmic side.
planned [[Robotics/impedance-control]] — Cartesian compliance; pairs well with RL for contact-rich tasks.
planned [[Robotics/trajectory-generation]] — provides $r (t)$ as RL observation or as a residual baseline.
[[Robotics/dynamics-rigid-body]] — basis of the simulator physics RL trains in.
planned [[Robotics/multirotor-design]] — drone-racing RL platform (Swift, Champion).
planned [[Robotics/manipulator-design]] — Franka / UR / ALOHA hardware for arm RL.
planned [[Robotics/computer-vision-robotics]] — perception inputs to vision-based RL.
planned [[Robotics/bayesian-estimation]] — state estimation that feeds the policy observation.

Engineering (foundations):

planned [[Robotics/rl-for-control]] — MDP foundations, policy-gradient derivations, exploration theory.
planned [[Robotics/computer-vision-robotics]] — neural-net machinery (MLP, ResNet, transformer).
planned [[Engineering/mpc-control]] — the model-based hybrid partner; MPC + RL residual is the 2025-26 frontier.
[[Engineering/classical-control]] — what RL replaces when models exist; what it wraps when they don’t.
planned [[Engineering/realtime-embedded]] — deploying a policy on a 1 ms control loop.

Languages:

planned [[Languages/Tier3/robotics-control]] — ros2_control + control_toolbox integration with learned policies.
planned [[Languages/Tier3/genai-llm-runtime]] — VLA models (RT-2, Pi-0, OpenVLA) deployment.

10. Citations

Foundational books

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free PDF at http://incompleteideas.net/book/the-book-2nd.html. The canonical reference.
Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific.
Kober, J., Bagnell, J. A. & Peters, J. (2013). “Reinforcement learning in robotics: A survey.” Int. J. Robot. Res. 32(11), 1238–1274. The foundational survey paper for robotics-specific RL.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. For the neural-net backbone.

Foundational papers — algorithms

Williams, R. J. (1992). “Simple statistical gradient-following algorithms for connectionist reinforcement learning.” Machine Learning 8(3-4), 229–256. REINFORCE.
Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” NIPS. The policy-gradient theorem.
Mnih, V. et al. (2015). “Human-level control through deep reinforcement learning.” Nature 518, 529–533. DQN.
Mnih, V. et al. (2016). “Asynchronous Methods for Deep Reinforcement Learning.” ICML. A3C / A2C.
Schulman, J., Levine, S., Moritz, P., Jordan, M. I. & Abbeel, P. (2015). “Trust Region Policy Optimization.” ICML. TRPO.
Schulman, J., Moritz, P., Levine, S., Jordan, M. I. & Abbeel, P. (2016). “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” ICLR. GAE.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. PPO — the workhorse for robotics.
Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” ICML. SAC.
Fujimoto, S., van Hoof, H. & Meger, D. (2018). “Addressing Function Approximation Error in Actor-Critic Methods.” ICML. TD3.
Chua, K., Calandra, R., McAllister, R. & Levine, S. (2018). “Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models.” NeurIPS. PETS.
Schrittwieser, J. et al. (2020). “Mastering Atari, Go, chess and shogi by planning with a learned model.” Nature 588, 604–609. MuZero.
Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). “Mastering Diverse Domains through World Models.” arXiv:2301.04104. DreamerV3.

Foundational papers — imitation and hybrid

Ross, S., Gordon, G. & Bagnell, D. (2011). “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” AISTATS. DAgger.
Ho, J. & Ermon, S. (2016). “Generative Adversarial Imitation Learning.” NeurIPS. GAIL.
Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B. & Song, S. (2023). “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS. Diffusion policies.
Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. (2023). “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS. ACT, ALOHA.
Fu, Z., Zhao, T. Z., Wu, J., Finn, C. & Levine, S. (2024). “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” arXiv:2401.02117.

Foundational papers — robotics RL deployments

Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V. & Hutter, M. (2019). “Learning agile and dynamic motor skills for legged robots.” Science Robotics 4(26). The first RL-controlled ANYmal.
Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V. & Hutter, M. (2020). “Learning quadrupedal locomotion over challenging terrain.” Science Robotics 5(47). ANYmal blind stairs.
Kaufmann, E., Bauersfeld, L., Loquercio, A., Müller, M., Koltun, V. & Scaramuzza, D. (2023). “Champion-level drone racing using deep reinforcement learning.” Nature 620, 982–987. Swift.
Siekmann, J., Godse, Y., Fern, A. & Hurst, J. (2021). “Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition.” ICRA. Cassie outdoor running.
Margolis, G. B., Yang, G., Paigwar, K., Chen, T. & Agrawal, P. (2024). “Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior.” CoRL 2023 (proceedings 2024).
Kumar, A., Fu, Z., Pathak, D. & Malik, J. (2021). “RMA: Rapid Motor Adaptation for Legged Robots.” RSS. Adaptation module.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. & Abbeel, P. (2017). “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.” IROS.
Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W. & Abbeel, P. (2018). “Asymmetric Actor Critic for Image-Based Robot Learning.” RSS.

VLA and foundation models

Brohan, A. et al. (2023). “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv:2307.15818. Google DeepMind.
Black, K. et al. (2024). ” $π_{0}$ : A Vision-Language-Action Flow Model for General Robot Control.” arXiv:2410.24164. Physical Intelligence.
Octo Model Team (2024). “Octo: An Open-Source Generalist Robot Policy.” arXiv:2405.12213. Berkeley.
Open X-Embodiment Collaboration (2023). “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv:2310.08864.

Software and documentation

Stable Baselines 3 documentation — https://stable-baselines3.readthedocs.io/.
Isaac Lab documentation — https://isaac-sim.github.io/IsaacLab/.
MuJoCo documentation — https://mujoco.org/.
Genesis (Genesis-Embodied-AI 2024) — https://github.com/Genesis-Embodied-AI/Genesis.
leggedrobotics/legged_gym (ETH) — https://github.com/leggedrobotics/legged_gym. The canonical PPO + Isaac Gym pipeline.
CleanRL — https://github.com/vwxyzjn/cleanrl. Single-file reference implementations.
rl_games — https://github.com/Denys88/rl_games. Isaac Lab default backend.
Crocoddyl (LAAS-CNRS) — DDP solver with RL hooks.
MuJoCo MPC / Playground (DeepMind 2024) — https://github.com/google-deepmind/mujoco_mpc.

Standards

ISO 10218-1:2011. Robots and robotic devices — Safety requirements — Part 1: Industrial robots. Constrains where a learned policy may be deployed.
ISO/TS 15066:2016. Collaborative robots — Safety requirements. Force / power limits in collaborative operation — design constraint for any RL-controlled cobot.

Compendium

Explorer

Reinforcement Learning for Robotics Control — Robotics Reference