Reinforcement Learning for Robotics Control — PPO, SAC, Sim-to-Real
Scope. Classical model-based control —
[[Robotics/pid-control]],[[Robotics/state-space-lqr]], MPC — designs a controller from a known plant model. RL learns the controller from interaction data with a plant (real or simulated) by maximising a scalar reward. This note is the robotics-applied counterpart: which algorithm to pick, how to design rewards that survive contact with hardware, how to bridge the sim-to-real gap, and the case studies of RL policies actually deployed on Cassie, ANYmal, Champion drone racing, and ALOHA bimanual manipulators. RL theory (MDP foundations, policy-gradient derivations, exploration theory) lives in[[Robotics/rl-for-control]](planned). Foundational deep-learning machinery lives in[[Robotics/computer-vision-robotics]](planned).
1. At a glance
RL for control means: learn a policy — typically a small neural network — that maximises expected cumulative reward by trial-and-error in a (usually simulated) plant. The policy is then deployed (zero-shot or after fine-tuning) on the real robot. As of 2024-2026 RL is in production on:
- Legged locomotion. Cassie outdoor running (Siekmann 2021, OSU), ANYmal-C blind stairs (Lee 2020), Spot research, Unitree Go2 default firmware ≥ 1.0.20, MIT Mini Cheetah, every recent quadruped paper. See
[[Robotics/legged-robotics]]§3.1 for the canonical PPO recipe. - Drone racing. Swift (Kaufmann 2023 Nature) beat three human champions on a fixed track using vision-based PPO at 100 Hz.
- Bimanual manipulation. Stanford ALOHA / Mobile ALOHA (Zhao 2023, Fu 2024) — imitation-trained ACT (Action Chunking Transformer) on $32 k bimanual robot.
- Generalist VLAs. RT-2 (Brohan 2023, Google), Pi-0 (Physical Intelligence 2024), Octo (Berkeley 2024), Gemini Robotics (2024) — vision-language-action models fine-tuned with RLHF or DPO from human demos.
- Industrial pick-and-place. Covariant Brain, Berkshire Grey, Mujin — proprietary RL-fine-tuned grasp policies trained in sim.
Why RL instead of model-based control. Three reasons stack:
- Contact-rich tasks. Robust gait on rubble, deformable-object manipulation, gripper closing on irregular geometry — situations where rigid-body MPC with friction cones underestimates the dynamics. RL absorbs the mismatch into the policy via domain randomisation.
- Vision-in-the-loop. Mapping raw RGB or depth to action through a single end-to-end neural net is more direct than the segment → object-pose → IK → trajectory pipeline. RL+CNN is one differentiable graph.
- Reward as spec. “Walk 1 m/s, don’t fall, minimise joint accelerations” is easier to write than to derive a Hamilton-Jacobi-Bellman solution. The reward is the contract; the policy is the implementation.
Why not RL. Equally important:
- Sample efficiency. Real-robot RL is data-starved: a manipulator collects 1 episode/s; a quadruped 10 episodes/s in sim, 0.1 episodes/s on real hardware. Anything that solves with a known model (flat-ground mobile-base path tracking — PID + MPC) should use the model.
- Reward hacking. The policy will find every exploit. A walking-reward bug → robot scoots on its belly. A clearance-reward bug → drone flips upside-down between gates because the reward only checks the ring centre.
- Sim-to-real gap. The dominant failure mode. A policy that wins in sim falls over on real ground because friction, motor latency, sensor noise, and unmodelled compliance differ from the simulator.
- Brittleness. Out-of-distribution states (unseen terrain, unseen object) → undefined behaviour. Model-based controllers degrade gracefully; learned policies can be unstable.
Where it sits in the design stack. Sensors → state representation (raw obs or learned feature) → policy network → low-level joint PD or torque → plant. The policy typically outputs target joint positions (not torques) at 50 Hz; an underlying stiff PD closes the inner loop at 1 kHz. See [[Robotics/legged-robotics]] §3.1 — every successful deployed RL controller uses this two-layer pattern, never end-to-end-to-torque.
Hybrid model-based + RL is the 2024-2026 frontier. Pure model-based MPC ([[Robotics/legged-robotics]] Di Carlo 2018) is interpretable and stable; pure RL is robust to model error but opaque. The hybrid pattern — model-based MPC for the planning skeleton + an RL residual policy for the contact and disturbance terms — wins on every metric. ANYmal-C’s “robust” mode is exactly this; so is Boston Dynamics’ “park controller” gait.
First ask before applying RL. Can a calibrated dynamics model + MPC solve this? If yes → use it; cheaper, debuggable, no sim-to-real gap. If no — do I have a fast simulator? (Isaac Lab, MuJoCo, Genesis). If no → consider imitation learning from demonstrations instead. Can I write a reward whose maximum corresponds to my actual goal? If you can’t, neither can the agent. Do I have a real-robot validation budget? Plan for 2–10× sim hours of real-time evaluation before a policy is shippable.
2. First principles
2.1 Markov Decision Process
The problem is formalised as a Markov Decision Process :
- — state space (continuous, for robotics: joint pos/vel, base orientation, contact sensors).
- — action space (continuous, : target joint positions, torques, or motor PWM).
- — stochastic transition kernel; the simulator (or real robot).
- — scalar reward, the only signal the agent gets.
- — discount factor. at 50 Hz gives an effective horizon of ~ 2 s; gives ~ 4 s. Robot RL uses .
Markov property: depends only on , not on history. Robots violate this (motor temperature, unobservable terrain) → either expand the state to include hidden variables or use a recurrent policy.
2.2 Value functions and the Bellman equation
The state-value function is expected return from under policy .
The action-value (Q) function .
The Bellman expectation equation for :
The Bellman optimality equation for (Bellman 1957):
Q-learning chases the latter by sampling; policy-gradient methods estimate directly.
2.3 Policy gradient theorem (Sutton 2000)
For a parametric policy the gradient of expected return decomposes as
where is the on-policy state distribution and is the advantage. Subtracting is a control variate — does not bias the estimator, but reduces variance by orders of magnitude. The estimator in practice uses Generalised Advantage Estimation (Schulman 2016):
trades bias for variance. PPO and A2C both consume GAE.
2.4 On-policy vs off-policy
On-policy algorithms (REINFORCE, A2C/A3C, PPO, TRPO) use only trajectories generated by the current . Each gradient step requires fresh rollouts → bad sample efficiency, but stable training. Default choice for sim-heavy robotics where rollouts are cheap.
Off-policy algorithms (DQN, DDPG, TD3, SAC) reuse old transitions stored in a replay buffer. Better sample efficiency (5–50× fewer environment steps to convergence) but trickier to stabilise. Default choice for real-robot RL where every interaction costs wall-clock.
2.5 PPO clipped objective (Schulman 2017)
The workhorse for robotics. Define the policy ratio . The clipped surrogate loss is
with default. The clip prevents single updates from moving the policy too far from , achieving (cheaply) what TRPO does with a KL trust region. The full PPO objective combines this with a value-function MSE and an entropy bonus:
Defaults: , . The entropy bonus prevents premature collapse to a deterministic policy.
2.6 SAC maximum-entropy objective (Haarnoja 2018)
Soft Actor-Critic maximises return + a policy-entropy bonus:
is the temperature, learned adaptively to track a target entropy (one bit per action dim — a useful default). The maximum-entropy formulation acts as automatic exploration and gives SAC excellent sample efficiency on manipulation tasks where exploration matters more than stability.
2.7 Model-based RL
Instead of model-free trial-and-error, learn the transition from data and plan in the learned model. Three families:
- PETS (Chua 2018) — probabilistic ensembles of MLPs; cross-entropy method for planning. State-of-the-art at 100k–1M sample budget.
- MuZero (Schrittwieser 2020) — learned latent dynamics + MCTS; mastered Atari + Go + chess + shogi with one architecture.
- DreamerV3 (Hafner 2023) — world models in latent space from pixels; one set of hyperparameters across 150+ tasks, no env-specific tuning.
Model-based achieves 10–100× sample efficiency vs model-free at the cost of architectural complexity and bias from model error.
2.8 Imitation learning alternative
When demonstrations are cheap and rewards are hard to design, imitate:
- Behaviour cloning (BC) — supervised regression on expert pairs. Brittle to compounding error (Ross 2011).
- DAgger (Ross 2011) — iteratively query the expert on policy rollouts. Fixes compounding error if expert is available online.
- GAIL (Ho 2016) — adversarial: a discriminator distinguishes expert from policy trajectories; policy fools it. No reward needed.
- Diffusion Policy (Chi 2023) — predict action sequence via denoising diffusion. SOTA on visuomotor manipulation; handles multi-modal demonstrations.
- ACT — Action Chunking Transformer (Zhao 2023) — predicts -step action chunks via transformer. Used in ALOHA / Mobile ALOHA; 50 demos suffice for many tasks.
2.9 Offline RL
When online interaction is too expensive (real robot, surgical procedure, surgical-robot-style critical systems), learn from a fixed dataset . The naive approach — fit Q on then maximise — fails because of distributional shift: the learned Q over-estimates value for out-of-distribution actions. Two dominant fixes:
-
Conservative Q-Learning (CQL, Kumar 2020) — penalise Q for out-of-dataset actions: The bracket term is a log-sum-exp upper bound on Q over the action distribution minus Q on dataset actions — pushes Q down on out-of-distribution actions.
-
Implicit Q-Learning (IQL, Kostrikov 2021) — never evaluates Q on out-of-distribution actions. Fits via expectile regression (upper expectile of over the dataset), uses as the bootstrap target. Policy extraction via advantage-weighted regression. The current SOTA on D4RL.
For real-robot RL, the common pattern is: collect — teleoperated episodes offline + train IQL or BC → deploy as warm start → fine-tune online with SAC.
2.10 Sim-to-real gap
The single largest engineering problem in robotics RL. The simulator is a wrong model: friction, motor latency, sensor noise, link compliance, contact impulses, sensor calibration — all differ from reality. Four families of mitigation:
- Domain randomisation (Tobin 2017) — randomise simulator parameters (friction, mass, motor strength, latency) each episode so the policy learns a single that works across the distribution. The current default.
- Asymmetric actor-critic (Pinto 2018) — critic sees privileged sim info (full state), actor sees only deployable sensors. Speeds up training without leaking privileged info into deployment.
- Teacher–student distillation (Lee 2020) — train teacher with privileged terrain/state info in sim, distill to a student that sees only proprioceptive + exteroceptive sensors. ANYmal blind-stairs recipe.
- Real-data fine-tuning — collect a small dataset on real hardware, fine-tune with offline RL (CQL, IQL) or behaviour cloning. RMA (Kumar 2021) trains an “adaptation module” online from a short history.
3. Practical math & worked examples
Example A — PPO loss for a quadruped (numbers)
Quadruped trained with PPO at 50 Hz policy rate, 1 kHz inner PD. Observation : 12 joint pos + 12 joint vel + 3 base ang-vel + 3 gravity-in-body + 3 commanded velocity + 12 last-action.
Action : target joint positions, fed to inner PD with N·m/rad, N·m·s/rad.
Policy network: 3-layer MLP, hidden sizes 512 → 256 → 128, ELU activation. Output head produces ; action sampled from during training, used at deployment.
PPO hyperparameters (Margolis 2024 / legged_gym defaults):
clip_ε = 0.2
γ = 0.99
λ (GAE) = 0.95
PPO epochs = 5
mini-batches = 4
learning rate = 1e-3 (with adaptive decay on KL)
entropy coef = 0.01
value coef = 1.0
max grad norm = 1.0
parallel envs = 4096 (Isaac Gym, RTX 4090)
steps / env = 24 → batch of 98 304 transitions per PPO updateWall-clock training budget: 12 000 PPO iterations × ~ 1.5 s/iter ≈ 5 hours on a single RTX 4090 to reach robust trot on flat ground; 12 hours to add stairs and slopes. Total environment steps: . This is 4 orders of magnitude more interaction than the real robot could ever provide — explaining why every legged RL pipeline trains in sim.
Compute breakdown. GPU peak: ~ 280 W. Simulator step (Isaac Gym physics + PD): 0.6 ms / step at 4096 envs → 6800 env-steps/s/env equivalent → 28 M env-steps/s aggregate. Policy forward + backward: 0.4 ms / minibatch. PPO update: ~ 80 ms wall-clock per iteration (20 grad steps).
Example B — Reward shaping for a manipulator reach task
Task: 7-DoF arm (Franka FR3) reach end-effector to target .
Sparse reward: With this, PPO requires env-steps to first success.
Shaped reward decomposed into three weighted terms:
With , , , . With shaped reward PPO converges in env-steps — 17× faster than sparse.
The danger: the shaping creates a different optimum from the sparse reward. Over-weighting produces a stationary policy (“don’t move, get 0 instead of negative”). Standard remedy: anneal from 0 over the first steps so the agent first learns the task, then learns to do it gently.
Example C — SAC for a Franka pick-and-place (off-policy real-robot setup)
Task: 7-DoF Franka FR3 picks a cube from a m workspace and places it on a target pad. Real-robot training (no sim).
Observation : 7 joint pos + 7 joint vel + 3 ee pos + 4 ee quat + 3 cube pos + 4 cube quat + 1 gripper width.
Action : 7 joint-velocity targets + 1 gripper command, all in after tanh squash.
Hyperparameters (SB3 SAC defaults plus Franka-specific):
γ = 0.99
τ (target update) = 0.005
batch size = 256
replay buffer = 1 000 000 transitions
learning rate = 3e-4 (actor, critic, alpha)
target entropy = -dim(A) = -8
gradient steps = 1 per env step
warmup steps = 10 000 (random policy)
policy network = 2 × 256 MLP, ReLU
Q-network = 2 × 256 MLP, ReLU (twin)At a real-robot rate of ~ 1 episode/30 s (5 s reset + 25 s rollout), env-steps takes ~ 14 hours. Convergence to 80 % pick success: steps. Off-policy data reuse means each transition contributes to ~ 50 gradient updates over the run — the source of SAC’s sample efficiency advantage over PPO.
Safety wrappers (mandatory): cartesian-velocity clip at 0.5 m/s; joint-torque clip at 80 % rated; force-sensor termination if N during free motion; auto-recovery routine that pushes the gripper to a safe home pose on termination.
Example D — Domain-randomisation budget for an ANYmal-class quadruped
Target: zero-shot transfer from Isaac Gym to real ANYmal. The DR distribution (Lee 2020 / Margolis 2024):
| Parameter | Sim default | Randomisation |
|---|---|---|
| Ground friction | 1.0 | |
| Restitution | 0.0 | |
| Payload mass added to base | 0 kg | kg |
| Centre-of-mass offset | 0 | cm × 3 axes |
| Motor scale | 1.0 | |
| Motor scale | 1.0 | |
| Motor torque saturation | 33.5 N·m | |
| Observation latency | 0 ms | ms |
| IMU gyro noise | 0 | rad/s |
| IMU accel noise | 0 | m/s² |
| Push velocity (every 8 s) | — | m/s × 3 axes |
Without DR: sim2real success rate < 20 %, robot falls within 10 s. With DR as above: success on flat terrain, on stairs without an adaptation module, with RMA-style online adaptation (Kumar 2021).
4. Design heuristics
4.1 Algorithm selection
| Situation | Pick | Rationale |
|---|---|---|
| Continuous action, sim available, lots of rollouts | PPO | Stable, scales to 1000s of parallel envs, the field default |
| Continuous action, sample-efficient needed (real robot) | SAC or TD3 | Off-policy + entropy = best sample efficiency on manipulation |
| Discrete action (board games, Atari, gait selection) | Rainbow DQN | Discrete value-based with all the bells |
| Expert demonstrations available | ACT, Diffusion Policy | Imitation beats RL when demos exist |
| Few hundred episodes only (real robot) | Offline RL: IQL, CQL | Avoids querying the env; learns from fixed data |
| Pixel input, hard exploration | DreamerV3 | World model + latent rollouts; SOTA on hard exploration |
| Trajectory optimisation under known model | iLQR / DDP | Not RL; the right tool when the model is good |
| Hybrid model-based + learned residual | MPC + RL residual | Best of both, the 2025-26 frontier |
4.2 Reward engineering — the hardest part
Reward design dominates RL success. Five rules from the legged-robot literature (Margolis 2024, Lee 2020, Hwangbo 2019):
- Decompose into task / safety / smoothness / regularisation. Each term has its own weight; tune in isolation, then together.
- Use squared L2 for tracking, hinge / ReLU for limits, exponential for soft success. saturates near the goal, avoiding unbounded gradients.
- Curriculum the weights. Anneal smoothness penalties from 0 over steps; otherwise the agent prefers stillness to risk.
- Penalise jerk, not just acceleration. enforces smooth action sequences and is the single most important addition for sim-to-real.
- Add survival bonus. per step alive. Pushes the agent toward not-falling before it has the gait figured out.
Anti-patterns:
- Hand-crafted “if-else” reward → reward hacking guaranteed.
- All-or-nothing sparse reward in long-horizon tasks → no learning signal.
- Reward with unbounded magnitude (e.g. ) → numerical instability, exploding gradients.
4.3 Observation and action design
Observation space — what the policy actually sees:
- Joint pos + joint vel — universal; always include.
- Base ang-vel + gravity-in-body-frame — for floating-base robots. Don’t include base linear velocity; it’s noisy and unobservable. Let the policy infer from history.
- Last action — in the observation makes the policy implicitly recurrent and stabilises training.
- Commanded velocity / goal — the user input.
- Contact sensors — 0/1 per foot; helps gait-aware policies.
- Privileged in sim only — friction, payload, terrain heights. Critic sees this, actor does not (asymmetric AC). At deploy time, train an adaptation module to estimate these from history.
Action space — what the policy outputs:
- Target joint positions to a stiff PD (the legged default). Stable, smooth, easy to clip. – N·m/rad, – N·m·s/rad.
- Joint torques directly — harder to train, more sim-to-real fragile, but allows fully exploiting actuator bandwidth. Used by MIT for backflips (Katz 2019).
- Target end-effector velocity / Cartesian wrench — for manipulation. Add IK or operational-space mapping inside the env.
- PWM duty — never. Too low-level; couples policy to actuator electronics.
Normalisation. Always normalise observations to ~ . Running-mean / running-std on each component, frozen at deploy. Actions scaled to with tanh squash on the policy output. Skipping this is the #1 cause of unstable training in beginner pipelines.
4.4 Sim choice
For RL specifically:
- Isaac Lab / Isaac Gym (NVIDIA) — GPU physics + GPU policy, 4096+ parallel envs on one RTX 4090, 100 k env-steps/s aggregate. The default for legged-robot RL since 2022.
- MuJoCo (Google DeepMind, Apache-2 since 2021) — CPU only but accurate contact. The default for manipulation and analytical-baseline papers.
- Genesis (CMU/Sea AI Lab, Dec 2024) — GPU physics, ~ 100× MuJoCo single-CPU; rapidly adopted in 2025.
- Brax (Google, JAX) — fully differentiable simulator; differentiable through contacts.
- RaiSim (ETH) — single-CPU but very fast on contact; the original ANYmal training stack.
- MuJoCo MPC / MuJoCo Playground (DeepMind 2024) — combined sim + RL training infrastructure with predictive-sampling MPC integration.
GPU sims (Isaac, Genesis, Brax) only beat CPU sims if your policy is small enough that data-loading is not the bottleneck. A 1M-param MLP at 4096 envs is comfortably GPU-bound; a 50M-param CNN starts saturating PCIe.
4.5 Sim-to-real bridge
Five techniques, in order of how much they buy you:
- Domain randomisation on the parameters that matter most: friction, motor strength, latency, base mass. The single biggest lever.
- Asymmetric actor-critic: privileged info to the critic only.
- Teacher-student distillation (Lee 2020): train a teacher with full info; distill to a deployable student via DAgger. Used for ANYmal blind stairs.
- RMA-style online adaptation (Kumar 2021): train a context encoder that maps a short observation history to a latent representation of the environment; deploy and adapt online.
- Real-data fine-tuning: collect 100–1000 real episodes; fine-tune with IQL or CQL. Used by Berkeley Octo for the last 5 % of performance.
A policy that fails sim-to-real almost always reveals a missing randomisation: motor delay is the most common omission, sensor latency the second.
4.6 Safety in real-robot RL
- Action limits at the env layer, not the policy. Clip target joint positions to a safe envelope before passing to the inner PD.
- Termination on safety violation — base angle > 60°, base height < 0.3 m, joint torque > 95 % rated for > 100 ms.
- Recovery policy — separate, conservative policy invoked when the main policy violates safety; resets the robot to a known-safe pose.
- Soft barriers in the reward rather than hard constraints — penalise approach to limits before they are hit.
- Shielded RL / constrained-MDP for hard guarantees: Achiam 2017 CPO, Stooke 2020 PID Lagrangian. Used by Tesla for FSD planning.
4.7 Curriculum learning patterns
Most non-trivial robotics RL pipelines fail without curriculum. Four patterns dominate:
- Terrain curriculum (Lee 2020 /
legged_gym). 10 difficulty levels of terrain (flat → slopes → stairs → gaps → rough). Each env is assigned a level; advances when episode return exceeds 80 % of max, demotes when below 30 %. Rolling success rate determines progression. - Command curriculum — start with slow commanded velocities ( m/s), widen to full range ( m/s) over steps. Prevents the policy from over-committing to high-speed gaits before slow-speed control is solid.
- Reward curriculum — anneal penalty terms from 0 to target weight over steps. Smoothness, action-rate, joint-acceleration penalties are common annealing targets.
- Task curriculum — task is broken into subgoals (reach → grasp → lift → place). Promote when prior subgoal is reliably solved. Used in OpenAI Rubik’s cube (2019) and Mobile ALOHA training.
The wrong curriculum gets stuck. Common pathologies: a fixed-pace schedule advances faster than the policy learns (collapse); a success-triggered schedule with too-high a threshold (gets stuck on a hard level forever). The fix is adaptive difficulty: tie the schedule to a moving-average success rate with both promotion and demotion thresholds.
4.8 Hyperparameter sensitivity
Robotics RL is famously sensitive to hyperparameters. Henderson 2018 (“Deep RL that matters”) showed that the same algorithm with different seeds can differ in final performance by 50 %+. The hyperparameters most worth tuning, in order:
- Learning rate — typical PPO range to ; use KL-adaptive decay.
- Entropy coefficient — 0.0 to 0.05; if policy collapses, raise; if returns plateau without converging, lower.
- Batch size (mini-batch within PPO) — larger = lower variance, slower wall-clock; ~ 16 k for legged, ~ 4 k for manipulation.
- GAE λ — 0.9 to 0.97; lower = less variance, more bias.
- Discount γ — 0.95 to 0.999; longer horizons need higher γ but also slower training.
- Clip ratio ε — PPO default 0.2; rarely worth tuning.
For SAC the analogue is target entropy + replay-buffer size + ratio of gradient steps per env step (UTD, “updates to data”). Higher UTD = more sample-efficient but unstable.
4.9 When NOT to use RL
- Flat-ground mobile-base path tracking →
[[Robotics/pid-control]]+[[Engineering/mpc-control]]. - Linear known plant (drone hover) →
[[Robotics/state-space-lqr]]. - Trajectory tracking with a known model and constraints → MPC.
- Single-task manipulation with a fixed object → scripted pose + force control.
- Anywhere the dynamics model is good and the cost is convex → use the model.
RL pays off where contact richness, vision, or task generality exceed what a model can capture.
4.10 Anatomy of a deployable RL pipeline
The end-to-end pipeline for a sim-trained, real-deployed RL policy has roughly twelve stages. Skipping any of them is the most common failure mode in undergraduate research projects:
- URDF / MJCF validation. Inertias, link masses, joint limits, joint friction, joint damping must match the real robot to within ~ 10 %. Use the
mjcf_to_urdfround-trip and visualise; pay particular attention to inertia tensors (often wildly wrong if generated from CAD without manual checking). - Actuator model. First-order torque dynamics: with ms for QDD. Add current limit + torque-speed curve. Without this, sim-to-real fails on aggressive maneuvers.
- Observation pipeline. Define the exact observation vector that will be available at deploy. Add Gaussian noise + delay during training to match the real stack.
- Reward design + scalar-balance check. Histogram every reward component over a random rollout; ensure no component dominates by more than 10×.
- PD baseline. Verify that the inner PD on top of the policy can hold position against a 10 N disturbance. If not, the policy will fight the PD.
- Sanity train on a simpler task. Stand-up reward only; verify policy learns to stand within env-steps.
- Full-task training with curriculum and domain randomisation.
- Sim deploy test. Evaluate in a held-out sim with stronger DR than training; success rate is the minimum bar.
- Real-robot setup. Real-time policy inference (~ 2 ms budget at 50 Hz); action-clipping safety wrapper; emergency-stop on safety violation.
- Real-robot evaluation. 50-episode baseline test; record everything (joint logs, video, IMU). Compare to sim distribution.
- Targeted DR refinement if sim-to-real gap is wide. Common: motor latency was 0 in sim but 8 ms in reality; widening DR closes the gap.
- Production wrap. ROS 2 node or equivalent integration, watchdog, telemetry, recovery policy.
Items 1–4 are pre-training; items 5–8 are sim development; items 9–12 are deployment. A typical first-time pipeline takes 2–3 months of engineer-time; mature labs (ETH RSL, MIT Biomimetic, Berkeley Robotics) reduce this to ~ 2 weeks for an incremental policy.
5. Components & sourcing
5.1 RL libraries
| Library | Language | Algorithms | Best for | License |
|---|---|---|---|---|
| Stable Baselines 3 | PyTorch | PPO, SAC, TD3, A2C, DQN, HER | Education, baselines | MIT |
| rl_games | PyTorch | PPO, SAC | Isaac Gym / Lab default backend | MIT |
| Tianshou | PyTorch | 30+ algorithms | Research benchmarks | MIT |
| RLlib (Ray) | PyTorch / TF | All major | Distributed, multi-agent | Apache 2 |
| Acme (DeepMind) | JAX / TF | All major | DeepMind-style research | Apache 2 |
| CleanRL | PyTorch | Single-file impls | Reproducibility, teaching | MIT |
| Brax (Google) | JAX | PPO, SAC, ES | Differentiable-sim RL | Apache 2 |
| PufferLib | PyTorch | High-throughput | Maximising env steps/s | MIT |
| TorchRL (Meta) | PyTorch | All major | Modular, integrated with PyTorch | MIT |
| JaxRL / SBX | JAX | PPO, SAC | Speed | MIT |
For robotics specifically, rl_games + Isaac Lab + legged_gym (leggedrobotics/legged_gym, ETH) is the canonical PPO pipeline; SB3 + Gymnasium + MuJoCo is the manipulation-research default; CleanRL is for “I want to read every line of the algorithm.”
5.2 Simulators
| Simulator | Vendor | Strength | Notes |
|---|---|---|---|
| Isaac Lab | NVIDIA, 2024 | GPU-parallel, photoreal | Successor to Isaac Gym; current default |
| MuJoCo | DeepMind (Apache 2 since 2021) | Contact accuracy | Industry standard for RL papers |
| Brax | Differentiable, JAX | Faster than MuJoCo on GPU | |
| Genesis | CMU + Sea AI Lab, 2024 | Fastest GPU physics | ~ 100× MuJoCo, claimed |
| PyBullet | OSS (Bullet Physics fork) | Free, slow | Education and legacy code |
| SAPIEN | UCSD | Articulated objects | Manipulation-focused |
| Gazebo (Garden, Harmonic) | Open Robotics | ROS 2 integration | Whole-system tests; weaker contact |
| Drake | TRI / MIT | Hydroelastic contact | Formal-methods-friendly |
| RaiSim | ETH | Single-CPU legged | Commercial license |
| MuJoCo MPC / Playground | DeepMind 2024 | Sim + MPC + RL | Predictive-sampling integration |
5.3 Robot model libraries
- MuJoCo Menagerie (DeepMind) — validated MJCF models for 30+ robots: Spot, Go2, H1, ANYmal, Cassie, UR5, Franka FR3, Aloha. The reference distribution.
- Isaac Lab assets — USD models for all major commercial robots.
- Robosuite (UT Austin) — manipulation tasks + objects + several arms.
- Robocasa (NVIDIA 2024) — kitchen-scale manipulation tasks.
5.4 Benchmarks
| Benchmark | Domain | Notes |
|---|---|---|
| DeepMind Control Suite | Continuous control | The 1990s/2000s baselines |
| Meta-World | Manipulation | 50 tasks, multi-task RL |
| D4RL | Offline RL | The canonical fixed-dataset benchmark |
| RLBench | Manipulation | 100 PyRep / CoppeliaSim tasks |
| ManiSkill (UCSD) | Manipulation, real-transfer | Pick-place, peg insertion |
| Habitat (Meta) | Navigation | Indoor photorealistic |
| CARLA | Driving | Urban driving sim |
| Procgen | Generalisation | Procedurally-generated 2D |
| NetHack Learning Environment | Hard exploration | The hardest unsolved RL benchmark |
5.5 Real-robot RL frameworks
- Isaac Manipulator + cuMotion (NVIDIA 2024) — production manipulation policy with optional RL fine-tuning; GPU-parallel collision-free motion + learned grasp.
- PSL / Pearl (Berkeley) — academic real-robot RL.
- ALOHA / Mobile ALOHA datasets — Zhao 2023, Fu 2024; open-source low-cost bimanual + the Mobile ALOHA wheeled base.
- Open-X-Embodiment (collaboration, 2023–24) — 22-institution dataset of 1M+ teleoperated episodes across 22 robot embodiments.
5.6 Foundation models for robotics (VLA — Vision-Language-Action)
| Model | Org | Year | Notes |
|---|---|---|---|
| RT-2 | Google DeepMind | 2023 | First VLA; PaLM-E backbone, language → action |
| OpenVLA | Stanford | 2024 | Open 7B-param VLA, fine-tunable |
| Octo | Berkeley | 2024 | Generalist policy, 800k+ episodes, transformer |
| Pi-0 () | Physical Intelligence | 2024 | Flow-matching VLA; deployed on ALOHA + Franka |
| Gemini Robotics | 2024 | Multimodal LM + manipulation head | |
| Helix | Figure AI | 2025 | Bimanual mobile humanoid foundation model |
| GR00T | NVIDIA | 2024 | Humanoid foundation model |
These are not strictly RL — most are imitation-trained on teleoperation, with RL or DPO fine-tuning for the last few percent. The RL community is converging on RL fine-tuning of pretrained VLAs as the dominant paradigm for general manipulation.
6. Reference data
6.1 Algorithm comparison
| Algorithm | On/off-policy | Action | Sample efficiency | Stability | Default for |
|---|---|---|---|---|---|
| REINFORCE | on | both | very low | very low | teaching |
| A2C / A3C | on | both | low | low | distributed teaching |
| PPO | on | both | medium | high | legged, drone, sim-heavy |
| TRPO | on | both | medium | high | predecessor to PPO |
| DDPG | off | continuous | medium | low | superseded by TD3/SAC |
| TD3 | off | continuous | high | medium | manipulation, real-robot |
| SAC | off | continuous | high | high | manipulation, sample-limited |
| DQN | off | discrete | medium | medium | discrete control, games |
| Rainbow DQN | off | discrete | high | high | Atari, discrete RL benchmark |
| MuZero | model-based | discrete + continuous | very high | medium | games, planning |
| PETS | model-based | continuous | very high | medium | low-data control |
| DreamerV3 | model-based | both | very high | high | pixel-based, generalist |
6.2 Simulator comparison
| Simulator | GPU? | Parallel envs | Contact accuracy | Cost |
|---|---|---|---|---|
| Isaac Lab | yes | 4096+ | medium-high (PhysX) | free |
| MuJoCo | no (CPU) | hundreds | excellent | free (Apache 2) |
| Brax | yes (JAX) | 1024+ | medium | free |
| Genesis | yes | thousands | high | free |
| PyBullet | no | dozens | medium | free |
| RaiSim | no (multithread CPU) | dozens | excellent | commercial |
| SAPIEN | partial | hundreds | medium-high | free |
| Gazebo | no | one | medium (ODE/DART) | free |
| Drake | no | one | excellent (hydroelastic) | free |
6.3 Common reward components (legged + manipulation)
| Term | Form | Typical weight | Purpose |
|---|---|---|---|
| Linear vel tracking | task | ||
| Angular vel tracking | task | ||
| Action rate | smoothness | ||
| Joint acceleration | smoothness | ||
| Joint torque | energy | ||
| Foot air-time | if | gait | |
| Foot contact force | impact gentle | ||
| Survival | per step alive | early bootstrap | |
| Base height | posture | ||
| Termination | if fallen | flat penalty | safety |
| Reach success | if | sparse | task success |
| Collision | flag | safety | |
| Joint-limit hinge | $-(\mathrm{ReLU}( | q | - q_{\max}))^2$ |
6.4 Sim-to-real techniques
| Technique | What it does | When it helps most |
|---|---|---|
| Domain randomisation (dynamics) | Vary friction, mass, , latency | Locomotion on unknown terrain |
| Domain randomisation (visual) | Vary lighting, texture | RGB-input policies |
| Asymmetric actor-critic | Critic sees privileged info | Speeds training without leak |
| Teacher–student distillation | Student deployable; teacher uses privileged | ANYmal blind stairs (Lee 2020) |
| RMA (rapid motor adaptation) | Online adaptation module | Mid-deployment terrain change |
| System identification | Calibrate sim to real | Manipulation, drone hover |
| Real-data fine-tuning (offline RL) | Polish in real | Last 5 % manipulation perf |
| Action / observation noise | Train with the noise you’ll see | Always |
| Action smoothing (low-pass) at deploy | Reject HF policy noise | Vibration-prone hardware |
6.5 Offline RL algorithm comparison
| Algorithm | Year | Key idea | Use when |
|---|---|---|---|
| BC | 1990s | Supervised regression on | Pure imitation, low compounding error |
| BCQ | 2019 | Constrain policy to dataset support | First offline RL with neural nets |
| CQL | 2020 | Penalise OOD-Q via log-sum-exp | Stable, common D4RL baseline |
| IQL | 2021 | Expectile regression, no OOD eval | Current SOTA, lower hyperparameter sensitivity |
| AWAC | 2020 | Advantage-weighted policy update | Online → offline transition |
| TD3+BC | 2021 | TD3 + behaviour-clone regulariser | Cheap, often competitive with CQL |
| Diffusion-QL | 2022 | Diffusion policy + Q-learning | Multi-modal action distributions |
| Decision Transformer | 2021 | Conditional sequence model on return | Transformer-native, good with long traj |
| CRR | 2020 | Critic-regularised regression | DeepMind production offline |
6.6 Hyperparameter starter pack (PPO, legged robot)
γ = 0.99, λ = 0.95
clip = 0.2, value clip = 0.2
lr = 1e-3 with KL-adaptive decay (target KL = 0.01)
n_envs = 4096 (Isaac Lab)
n_steps = 24
n_epochs = 5, n_minibatches = 4
entropy_coef = 0.01 (annealed to 0.001 over 5 000 iter)
value_coef = 1.0
grad_clip = 1.0
hidden = [512, 256, 128], ELU
action_scale = 0.25 (target joint delta in rad)
init_log_std = 0.0
network init = orthogonal (gain = sqrt(2))6.7 Training budgets (deployed robotics RL pipelines)
| System | Algorithm | Sim | Parallel envs | Env-steps | Wall-clock | GPU |
|---|---|---|---|---|---|---|
| ANYmal blind stairs (Lee 2020) | PPO teacher + DAgger student | RaiSim | 200 | 4 days | none (CPU) | |
| MIT Mini Cheetah (Margolis 2023) | PPO | Isaac Gym | 4096 | 5–12 h | 1× RTX 3090 | |
| Unitree Go2 (default 2024 firmware) | PPO + RMA | Isaac Gym | 4096 | 8–12 h | 1× RTX 4090 | |
| Swift drone (Kaufmann 2023) | PPO + sim-real residual | Custom Flightmare | 100 | + 50 real episodes | 24 h sim + 1 day real | 1× RTX |
| Cassie sim2real running (Siekmann 2021) | PPO + recurrent | MuJoCo | 64 | 7 days | 1× V100 | |
| ALOHA pick-place (Zhao 2023) | ACT (imitation) | n/a (real only) | 1 | 50 demos | hours | 1× RTX 3090 |
| RoboPianist (Zakka 2023) | SAC | MuJoCo | 64 | 2 days | 1× A100 |
7. Failure modes & debugging
7.1 Symptom → cause table
| Symptom | Likely cause | Fix |
|---|---|---|
| Policy exploits a reward bug | Reward hacking | Redesign reward; add safety term; freeze incentive that’s exploited |
| Policy collapses to constant action | Mode collapse / entropy too low | Raise entropy coef; lower temperature (SAC); larger batch |
| Sim works, real falls | Sim-to-real gap | Harder DR on the parameter that differs; add asymmetric AC; teacher-student |
| Catastrophic forgetting during fine-tune | Distribution shift | Replay buffer of old states; LoRA-style param-efficient fine-tune |
| Curriculum stalls at level N | Wrong difficulty schedule | Use adaptive curriculum (success-rate triggered) |
| Random training crash mid-run | Optimiser instability | Save checkpoints every 100 iters; LR warmup |
| Return variance huge in PPO | Batch too small or LR too high | Larger batch (≥ 32k); more PPO epochs (5–10); KL-adaptive LR |
| Sparse reward never trains | No learning signal | BC warm-start; curiosity bonus; reward shaping |
| Actions saturate (always at ) | Output not squashed; action scale wrong | Tanh on output; rescale env’s action limits |
| Privileged info leaks to student | Observation spec error | Strict obs-spec assertion in env wrapper |
| Perfect-torque simulator artifact | Idealised motor model | Add motor latency, current limit, friction model |
| Non-reproducible across GPUs | Non-deterministic CUDA | torch.backends.cudnn.deterministic = True; seed everything |
| VRAM OOM at 4096 envs | Replay or trajectory buffer too big | Reduce envs or grad-accumulate |
| Slow eval blocks training | Eval too often | Eval in single env at end of each PPO update only |
| Policy oscillates at deploy | HF noise in policy output | Low-pass filter actions (1st-order, ms) |
| Sudden gait change mid-episode | Reward shifts as curriculum advances | Anneal reward weights over steps |
7.2 Detailed failure modes
-
Reward hacking. The agent will find every reward bug. Lee 2020 (ANYmal) initially saw the policy “trot” by oscillating its base — exploiting an angular-velocity-tracking term that didn’t constrain CoM motion. Fix: add a base-pitch penalty and a survival bonus; require both directional velocity and progress.
-
Mode collapse. A SAC policy with too-low entropy temperature converges to a single trajectory regardless of state. Fix: set (auto-temperature) or raise manually. For PPO, raise the entropy coefficient from 0.01 to 0.05 and check that policy std doesn’t collapse below 0.05.
-
Sim-to-real failure — actuator latency. The most common omission. Real motors have 1–25 ms torque-delivery latency from current-command to actual torque; ideal simulators give zero. A policy that depends on instantaneous torque response will oscillate on hardware. Fix: include ms action latency in DR.
-
Catastrophic forgetting at fine-tune. Fine-tuning a sim-trained policy on real episodes causes it to forget sim-only skills (e.g. recovery from large pushes). Fix: keep a replay buffer of sim-collected states; mix sim + real in fine-tune batches at 4:1.
-
Curriculum stuck. A fixed terrain progression schedule fails when the policy plateaus at a level. Fix: use a success-rate-triggered curriculum (graduate to next level when running success ≥ 70 %); demote on regress.
-
High-variance PPO returns. Default PPO with 1024 parallel envs has noisy advantage estimates. Fix: increase parallel envs to 4096+, use 5 PPO epochs over 4 mini-batches, and clip the value-function update with the same as the policy.
-
Sparse-reward training fails. Cannot bootstrap with no signal. Fix: behaviour-clone the first steps from a scripted policy; add a curiosity bonus (RND, Burda 2018); use Hindsight Experience Replay (HER, Andrychowicz 2017) for goal-conditioned tasks.
-
Action saturation. Network output unbounded; tanh missing. Fix: always apply tanh squash at the policy output; scale action range explicitly in env.
-
Privileged info leaks. Asymmetric actor-critic — the actor accidentally gets privileged data via shared parameters. Fix: separate actor and critic networks completely; explicit
assert obs_actor.shape == expectedin the env step. -
Sim-only artifacts. Perfect contact restitution = 0 in default MuJoCo; real foot strikes have –. Real-world impacts feel “harder” than sim. Fix: randomise restitution; add foot compliance to the URDF.
-
Non-reproducibility. Same code, different machine, different result. Fix: seed Python, NumPy, PyTorch, CUDA; pin simulator version; record git hash + container image; the canonical Stable Baselines 3 / CleanRL reproducibility checklists are good references.
-
VRAM exhaustion. With 4096 parallel envs + large observation vectors + a transformer policy, even an RTX 4090 (24 GB) runs out. Fix: reduce envs to 2048; gradient-accumulate over 2 micro-batches; use bfloat16 for the simulator output.
-
Slow eval drags training. Evaluating in single env every iteration takes seconds; with 4096-env training one PPO iter is < 2 s — eval becomes the bottleneck. Fix: eval only every 100 iterations, in a single env.
-
Deployed policy oscillates. HF noise in policy output causes motor whine and torque chatter. Fix: 1st-order low-pass filter on actions with ms; or train with explicit action-rate penalty.
-
Gait spontaneously changes mid-episode. Curriculum reward shift mid-training. Fix: anneal smoothly over steps, never step.
7.3 Quick debug procedure for a stuck pipeline
- Verify env determinism. Set seed, take 100 steps with fixed actions; confirm rollouts identical across runs.
- Check reward bounds. Histogram every reward component over a random rollout. Any > in magnitude dominates; rescale.
- Check action distribution. At steps, action std should be on every dim. If not → entropy too low.
- Check value-function fit. predicting return should have ; if not, value-loss coef may be too low.
- Sanity-train on a known task. Cartpole or Pendulum; PPO/SAC should converge in steps. If not, the codebase is broken.
- Profile env-step rate. Should be steps/s/env in MuJoCo, aggregate in Isaac. If not, env or sim is the bottleneck.
8. Case studies
Case A — ANYmal blind locomotion over challenging terrain (Lee et al. 2020, Science Robotics)
Hwangbo 2019 (Science Robotics) showed PPO-trained policies on ANYmal could replace ETH’s hand-engineered controller; Lee 2020 extended this to blind traversal of stairs, rubble, slippery slopes with only proprioceptive sensing.
Method. Teacher-student distillation:
- Teacher trained in RaiSim with PPO using privileged observations: terrain heightmap around each foot, friction at each contact, contact states, external disturbances. With this info, an MLP policy learns robust gait in env-steps (~ 4 days on a 20-core CPU; no GPU).
- Student — TCN (temporal convolutional network) trained via DAgger to clone teacher’s actions from a 50-step history of proprioception only (joint pos, joint vel, IMU, last action). No terrain info, no contact info, no friction.
The student’s TCN implicitly estimates the privileged variables from history — a context-encoder pattern that became the template for all subsequent legged RL.
Deployment. On real ANYmal-C, the blind student traversed:
- 20 km of forest trail
- 14 cm step stairs
- 25° slopes
- Rubble piles
- Slippery wet leaves
- Snow
Failure modes on real: ice (the friction estimate from history wasn’t enough; later fixed with explicit friction estimation, Bloesch 2018).
Why it matters. This paper established the teacher-student pattern, the privileged-info-in-sim asymmetric approach, and the outdoor robustness benchmark every later legged paper compared against. It also demonstrated that a 5M-parameter neural net could replace a thousand-line C++ MPC + WBC pipeline for legged locomotion.
Case B — Champion-level drone racing (Kaufmann et al. 2023, Nature)
UZH and Intel built Swift, a 1-kg quadrotor that beat three human world-champion FPV pilots on a fixed indoor track. Swift used pure vision (one 640×480 onboard camera) + IMU, no external markers, no motion capture.
Method. Three networks, trained separately:
- Gate-pose estimator — supervised CNN from RGB to relative gate corners; trained on photorealistic images from Flightmare.
- State predictor — Kalman-like RNN fusing IMU + gate detections to 6-DOF pose at 100 Hz.
- Control policy — PPO-trained MLP from estimated pose + body rates to motor commands at 100 Hz. Trained in custom physics sim with realistic motor dynamics for env-steps.
Sim-to-real bridge. Two phases:
- Phase 1 — sim training with DR on motor torque (±15 %), drag, mass (±10 %), latency (±5 ms).
- Phase 2 — real-world residual fine-tuning: collect 50 real episodes, fit a residual model of (sim → real) discrepancy, retrain the policy with the residual injected into the simulator.
Result. On 25 head-to-head races vs human champions (Alex Vanover, Thomas Bitmatta, Marvin Schäpper), Swift won 15. Best human lap: 17.956 s; Swift best lap: 17.465 s.
Why it matters. First demonstration of RL beating humans on a high-dimensional, vision-in-the-loop, real-physical-world task. The (sim → real residual) bridge is now standard for drones; the architecture is roughly the same as ALOHA’s perception stack but with a different action head.
Case C — Mobile ALOHA bimanual mobile manipulation (Fu, Zhao et al. 2024, Stanford)
Stanford’s ALOHA project (Zhao 2023, Fu 2024) is the canonical low-cost-bimanual / imitation-learning success story. Mobile ALOHA puts the original 40 k vs $300 k+ for comparable commercial hardware.
Method — ACT (Action Chunking Transformer).
- Input: 4 RGB cameras (640×480) + arm joint positions + wheel velocities.
- Architecture: ResNet-18 image encoder per camera + transformer encoder + transformer decoder.
- Output: 50-step chunk of (14-DoF arm joints + 2-DoF wheel velocities) at 50 Hz; execute first 25, re-plan.
- Training: supervised on teleoperated demonstrations per task. 1× RTX 3090, ~ 5 h.
Tasks demonstrated. Cooking shrimp; storing a 3-prong pot; using an elevator; rinsing a wine glass; pushing chairs; entering a room; bimanual wiping.
Why it matters. The success-with-50-demos result reframed the field: for many household tasks, imitation learning with a powerful enough policy class (transformer + action chunking) beats RL. The dataset and code are open-source (https://mobile-aloha.github.io/) and have spawned a generation of 50k bimanual research robots.
RL connection. Mobile ALOHA itself is imitation, but the Physical Intelligence model (Black 2024) extends the same architecture with RL fine-tuning + flow-matching action prediction, generalising across 8+ robot embodiments from the Open-X-Embodiment dataset.
Case D — RoboPianist dexterous bimanual piano (Zakka et al. 2023, Google DeepMind + Berkeley)
A pair of simulated 23-DoF Shadow Hands playing piano via SAC. Not a real-robot deployment but a milestone in high-dimensional contact-rich RL: 46 actuated DoF, simultaneous coordination, fingertip-key contacts switching at ~ 10 Hz.
Method.
- Simulator: MuJoCo MJX (JAX-accelerated MuJoCo, 2023).
- Algorithm: SAC with state-only observation (joint pos, vel, fingertip pos, MIDI note schedule lookahead).
- Reward decomposition: note-hit accuracy + key-press timing + energy penalty + finger-collision penalty.
- Curriculum: training songs ranked by difficulty (Czerny exercises → Chopin études); progress via success threshold.
- Training: env-steps per piece, 2 days on a single A100.
Result. Plays 150+ pieces from the MIDI dataset at human-comparable accuracy. Open-sourced (https://kzakka.com/robopianist/).
Why it matters. Proves SAC scales to ~ 50-dim action spaces with rich contact when given a good simulator and a decomposed reward. The same architecture (SAC + MJX + decomposed reward + curriculum) is now the template for in-hand reorientation (Chen 2022 OpenAI Rubik’s cube heir).
9. Cross-references
Robotics (this library):
[[Robotics/pid-control]]— the workhorse alternative for tasks where dynamics are known and contact-poor.[[Robotics/state-space-lqr]]— model-based optimal control; the “control-theory” answer for linear plants.[[Robotics/legged-robotics]]— RL is now the default locomotion controller; this note is the algorithmic side.- planned
[[Robotics/impedance-control]]— Cartesian compliance; pairs well with RL for contact-rich tasks. - planned
[[Robotics/trajectory-generation]]— provides as RL observation or as a residual baseline. [[Robotics/dynamics-rigid-body]]— basis of the simulator physics RL trains in.- planned
[[Robotics/multirotor-design]]— drone-racing RL platform (Swift, Champion). - planned
[[Robotics/manipulator-design]]— Franka / UR / ALOHA hardware for arm RL. - planned
[[Robotics/computer-vision-robotics]]— perception inputs to vision-based RL. - planned
[[Robotics/bayesian-estimation]]— state estimation that feeds the policy observation.
Engineering (foundations):
- planned
[[Robotics/rl-for-control]]— MDP foundations, policy-gradient derivations, exploration theory. - planned
[[Robotics/computer-vision-robotics]]— neural-net machinery (MLP, ResNet, transformer). - planned
[[Engineering/mpc-control]]— the model-based hybrid partner; MPC + RL residual is the 2025-26 frontier. [[Engineering/classical-control]]— what RL replaces when models exist; what it wraps when they don’t.- planned
[[Engineering/realtime-embedded]]— deploying a policy on a 1 ms control loop.
Languages:
- planned
[[Languages/Tier3/robotics-control]]— ros2_control + control_toolbox integration with learned policies. - planned
[[Languages/Tier3/genai-llm-runtime]]— VLA models (RT-2, Pi-0, OpenVLA) deployment.
10. Citations
Foundational books
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free PDF at
http://incompleteideas.net/book/the-book-2nd.html. The canonical reference. - Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific.
- Kober, J., Bagnell, J. A. & Peters, J. (2013). “Reinforcement learning in robotics: A survey.” Int. J. Robot. Res. 32(11), 1238–1274. The foundational survey paper for robotics-specific RL.
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. For the neural-net backbone.
Foundational papers — algorithms
- Williams, R. J. (1992). “Simple statistical gradient-following algorithms for connectionist reinforcement learning.” Machine Learning 8(3-4), 229–256. REINFORCE.
- Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” NIPS. The policy-gradient theorem.
- Mnih, V. et al. (2015). “Human-level control through deep reinforcement learning.” Nature 518, 529–533. DQN.
- Mnih, V. et al. (2016). “Asynchronous Methods for Deep Reinforcement Learning.” ICML. A3C / A2C.
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I. & Abbeel, P. (2015). “Trust Region Policy Optimization.” ICML. TRPO.
- Schulman, J., Moritz, P., Levine, S., Jordan, M. I. & Abbeel, P. (2016). “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” ICLR. GAE.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. PPO — the workhorse for robotics.
- Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” ICML. SAC.
- Fujimoto, S., van Hoof, H. & Meger, D. (2018). “Addressing Function Approximation Error in Actor-Critic Methods.” ICML. TD3.
- Chua, K., Calandra, R., McAllister, R. & Levine, S. (2018). “Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models.” NeurIPS. PETS.
- Schrittwieser, J. et al. (2020). “Mastering Atari, Go, chess and shogi by planning with a learned model.” Nature 588, 604–609. MuZero.
- Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). “Mastering Diverse Domains through World Models.” arXiv:2301.04104. DreamerV3.
Foundational papers — imitation and hybrid
- Ross, S., Gordon, G. & Bagnell, D. (2011). “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” AISTATS. DAgger.
- Ho, J. & Ermon, S. (2016). “Generative Adversarial Imitation Learning.” NeurIPS. GAIL.
- Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B. & Song, S. (2023). “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS. Diffusion policies.
- Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. (2023). “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS. ACT, ALOHA.
- Fu, Z., Zhao, T. Z., Wu, J., Finn, C. & Levine, S. (2024). “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” arXiv:2401.02117.
Foundational papers — robotics RL deployments
- Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V. & Hutter, M. (2019). “Learning agile and dynamic motor skills for legged robots.” Science Robotics 4(26). The first RL-controlled ANYmal.
- Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V. & Hutter, M. (2020). “Learning quadrupedal locomotion over challenging terrain.” Science Robotics 5(47). ANYmal blind stairs.
- Kaufmann, E., Bauersfeld, L., Loquercio, A., Müller, M., Koltun, V. & Scaramuzza, D. (2023). “Champion-level drone racing using deep reinforcement learning.” Nature 620, 982–987. Swift.
- Siekmann, J., Godse, Y., Fern, A. & Hurst, J. (2021). “Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition.” ICRA. Cassie outdoor running.
- Margolis, G. B., Yang, G., Paigwar, K., Chen, T. & Agrawal, P. (2024). “Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior.” CoRL 2023 (proceedings 2024).
- Kumar, A., Fu, Z., Pathak, D. & Malik, J. (2021). “RMA: Rapid Motor Adaptation for Legged Robots.” RSS. Adaptation module.
- Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. & Abbeel, P. (2017). “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.” IROS.
- Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W. & Abbeel, P. (2018). “Asymmetric Actor Critic for Image-Based Robot Learning.” RSS.
VLA and foundation models
- Brohan, A. et al. (2023). “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv:2307.15818. Google DeepMind.
- Black, K. et al. (2024). ”: A Vision-Language-Action Flow Model for General Robot Control.” arXiv:2410.24164. Physical Intelligence.
- Octo Model Team (2024). “Octo: An Open-Source Generalist Robot Policy.” arXiv:2405.12213. Berkeley.
- Open X-Embodiment Collaboration (2023). “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv:2310.08864.
Software and documentation
- Stable Baselines 3 documentation —
https://stable-baselines3.readthedocs.io/. - Isaac Lab documentation —
https://isaac-sim.github.io/IsaacLab/. - MuJoCo documentation —
https://mujoco.org/. - Genesis (Genesis-Embodied-AI 2024) —
https://github.com/Genesis-Embodied-AI/Genesis. leggedrobotics/legged_gym(ETH) —https://github.com/leggedrobotics/legged_gym. The canonical PPO + Isaac Gym pipeline.- CleanRL —
https://github.com/vwxyzjn/cleanrl. Single-file reference implementations. - rl_games —
https://github.com/Denys88/rl_games. Isaac Lab default backend. - Crocoddyl (LAAS-CNRS) — DDP solver with RL hooks.
- MuJoCo MPC / Playground (DeepMind 2024) —
https://github.com/google-deepmind/mujoco_mpc.
Standards
- ISO 10218-1:2011. Robots and robotic devices — Safety requirements — Part 1: Industrial robots. Constrains where a learned policy may be deployed.
- ISO/TS 15066:2016. Collaborative robots — Safety requirements. Force / power limits in collaborative operation — design constraint for any RL-controlled cobot.