Robot Learning & RL — From PPO Locomotion to VLA Foundation Models

The 2018-2026 maturation of reinforcement learning for physical robots: model-free policy gradient methods (PPO, SAC, TD3) trained at massive parallelism in GPU simulators (Isaac Lab, Brax, MuJoCo MJX), model-based methods (Dreamer, PILCO, PETS) for sample-efficient real-world learning, sim-to-real bridges (domain randomization, RMA — Rapid Motor Adaptation), and the explosion of robot foundation models — RT-1, RT-2, RT-X, OpenVLA, Octo, π0, π0.5, GR00T, Helix — that consolidated cross-embodiment manipulation into a few canonical recipes. This note is the technical companion to rl-for-control (control-focused), imitation-learning (LfD/diffusion/VLA-focused), and sim-to-real (simulator + DR-focused), with attention to the algorithms-and-platforms intersection: which RL method works on which embodiment, which simulator + training recipe produced which result, and where the field’s open problems live (sample efficiency, safe exploration, long-horizon generalization, cross-embodiment transfer, language conditioning).

1. At a glance

Reinforcement learning trains a policy $π_{θ} (a ∣ s)$ to maximize expected discounted return $J (θ) = E_{τ \sim π_{θ}} [\sum_{t} γ^{t} r_{t}]$ . The robotics-specific challenges are:

Sample efficiency. A real robot generates ~1000 transitions / minute. Atari DQN needed 200M transitions — infeasible on hardware. Solutions: sim, model-based RL, offline RL on logged data, sim-to-real transfer.
Safety during exploration. A random policy on a humanoid falls. On a $\$ 100k$ robot it falls expensively. Solutions: simulator training, safety filters (CBFs), constrained MDPs, residual RL over a stable nominal controller.
Reward design. Sparse rewards (“succeed or fail”) give no gradient. Dense rewards bias toward local optima. Reward shaping + curriculum + intrinsic motivation. Or skip rewards: imitation learning.
Generalization. Train on terrain A, test on terrain B. Cross-embodiment: train on Franka, deploy on UR5. Until 2022 this didn’t generalize; 2022+ VLAs do.
Long horizon. 1000-step tasks compound errors. Hierarchy (options, skills, latent plans) or transformer-based memory.

The 2024-2026 state of practice splits into three regimes:

Locomotion-class. Massively parallel PPO in Isaac Lab / Brax / MuJoCo MJX. 4096-65536 parallel envs. Train for hours, deploy as a closed-loop policy at 50-1000 Hz. Methods: PPO, SAC, RMA, teacher-student distillation. Embodiments: ANYmal, Cassie/Digit, Atlas, Unitree H1/G1/B2, Tesla Optimus, Figure 02, 1X NEO.
Manipulation-class. Imitation learning (BC, ACT, Diffusion Policy) on teleop data, plus residual RL or RL fine-tuning. Methods: Diffusion Policy, ACT, RT-2, OpenVLA, π0. Embodiments: Franka, UR5, xArm, ALOHA, Stretch, Tiago.
Foundation-model-class. Pretrained VLA → LoRA fine-tune for new task / embodiment. Methods: OpenVLA, π0.5, Octo, GR00T, Helix. Recipes: Open-X-Embodiment pretraining + 50-500 task-specific demos.

Where this sits. RL for control is the parent overview; sim-to-real is the deployment bridge; imitation learning is the demonstration-based companion; MPC is the model-based alternative often hybridized with RL; legged locomotion and manipulators are the primary embodiments; vision feeds VLAs.

First ask. Is the policy continuous or discrete? Continuous (joint targets, end-effector pose) → SAC or PPO; PPO is the workhorse. On-policy or off-policy? On-policy (PPO) is simpler but sample-hungry — fine for sim. Off-policy (SAC, TD3) for real data. Model-free or model-based? Model-based (Dreamer, PILCO, PETS) for ≤10× sample efficiency at cost of model complexity. Do you have demonstrations? If yes, start with BC; fine-tune with RL only if needed. Single task or general? Single → train from scratch. Generalist → fine-tune VLA.

2. First principles — algorithm taxonomy

2.1 The MDP framework

A Markov Decision Process is $(S, A, P, r, γ)$ :

$S$ : state space (joint positions/velocities, sensor readings, images).
$A$ : action space (joint torques, motor targets, end-effector poses).
$P (s^{'} ∣ s, a)$ : transition dynamics.
$r (s, a)$ : reward function.
$γ \in [0, 1)$ : discount factor.

Policy $π (a ∣ s)$ maximizes $J (π) = E [\sum_{t} γ^{t} r_{t}]$ . Value functions: $V^{π} (s) = E_{π} [\sum_{t} γ^{t} r_{t} ∣ s_{0} = s]$ and $Q^{π} (s, a) = E_{π} [\sum_{t} γ^{t} r_{t} ∣ s_{0} = s, a_{0} = a]$ .

2.2 Policy gradient (REINFORCE)

$\nabla_{θ} J (θ) = E_{π} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot Q^{π} (s, a)]$

High variance; needs baseline $V^{π}$ to subtract. Advantage $A (s, a) = Q (s, a) - V (s)$ .

2.3 DQN — Deep Q-Network (Mnih et al. DeepMind 2013/2015)

The breakthrough that started deep RL. Approximate $Q^{*} (s, a)$ with a deep network, train via Bellman error on a replay buffer:

$L (θ) = E_{(s, a, r, s^{'})} [(r + γ max_{a^{'}} Q_{θ^{-}} (s^{'}, a^{'}) - Q_{θ} (s, a))^{2}]$

Target network $θ^{-}$ stabilizes training. Atari 49-game superhuman. Discrete actions only; rarely used for robotics control directly but spawned the field.

2.4 A3C / A2C — Asynchronous Advantage Actor-Critic (Mnih 2016)

Parallel actor workers; synchronous (A2C) or asynchronous (A3C). Reduced variance via advantage. Predecessor to PPO.

2.5 TRPO — Trust Region Policy Optimization (Schulman 2015)

Constrains KL divergence between old and new policy:

$max_{θ} E [\frac{π _{θ}}{π _{θ_{old}}} A]$ subject to $E [KL (π_{θ_{old}}, π_{θ})] \leq δ$

Second-order optimization via natural gradient. Stable but expensive.

2.6 PPO — Proximal Policy Optimization (Schulman 2017)

The workhorse of robotics RL. Approximates TRPO via a clipped objective:

$L^{CLIP} (θ) = E [min (r_{θ} A, clip (r_{θ}, 1 - ϵ, 1 + ϵ) A)]$

where $r_{θ} = π_{θ} / π_{θ_{old}}$ . Clip $ϵ$ typically 0.2. First-order SGD; trains stably. Used in OpenAI Dactyl, every ANYmal RL controller, Cassie running, ALOHA’s RL components, every Isaac Lab humanoid policy.

2.7 DDPG — Deep Deterministic Policy Gradient (Lillicrap 2015)

Off-policy actor-critic for continuous actions. Deterministic policy $μ_{θ} (s)$ + critic $Q_{ϕ} (s, a)$ . Replay buffer + target networks. Sample efficient but brittle to hyperparams.

2.8 TD3 — Twin Delayed DDPG (Fujimoto 2018)

Three additions to DDPG: clipped double-Q (use min of two critics to reduce overestimation), delayed policy updates (update actor less often than critic), target policy smoothing (add noise to next-state action). Stable replacement for DDPG.

2.9 SAC — Soft Actor-Critic (Haarnoja 2018)

Maximum-entropy off-policy actor-critic. Objective $J (π) = E [r + α H (π (\cdot ∣ s))]$ — adds an entropy bonus. Two critics + auto-temperature tuning. Sample efficient; the standard choice when you have real-world data and want off-policy reuse.

2.10 Model-based RL

Learn a dynamics model $\hat{P} (s^{'} ∣ s, a)$ from data; then plan or distill a policy.

PILCO — Probabilistic Inference for Learning Control (Deisenroth + Rasmussen 2011). Gaussian Process dynamics model + analytic policy gradient. Famously trained a cart-pole in 20 trials. Doesn’t scale to image observations.

PETS — Probabilistic Ensembles with Trajectory Sampling (Chua et al. 2018). Bootstrapped ensemble of dynamics models; MPC over the ensemble. Strong on MuJoCo continuous control. Used in real-world RL.

Dreamer V1/V2/V3 (Hafner-Lillicrap-Norouzi-Ba, DeepMind/Google 2019/2020/2023). Learn a latent dynamics model (RSSM — Recurrent State-Space Model) from pixels; train policy via imagined rollouts in latent space. Dreamer V3 (2023) is the first algorithm to solve Minecraft Diamond from pixels with no priors. Robotics deployments: real-quadruped grass walking (Wu-Escontrela-Pathak-Shen 2022).

MuZero / EfficientZero (Schrittwieser 2020; Ye-Liu-Kurutach 2021). Learn dynamics + value + policy jointly; plan via MCTS. EfficientZero achieves human-level Atari in 2 hours of data — 100× improvement over Rainbow DQN. Less used in robotics due to discrete-action assumption.

2.11 Hindsight Experience Replay (HER, Andrychowicz 2017)

For sparse-reward goal-conditioned RL. Relabel failed trajectories with achieved-goal as if it were the desired goal. Turns every trajectory into a successful one for some goal. Critical for pushing / grasping tasks.

2.12 Residual RL (Silver-Allen-Tedrake 2018, Johannink-Bahl-Nair 2019)

Decompose policy as $π = π_{nominal} + π_{residual}$ . Nominal is a hand-coded or model-based controller; residual is learned. Sample efficient (only learn the residual), safe (nominal provides a baseline), interpretable. Used in industrial-arm RL (Universal Robotics’ residual controllers), in surgical RL (intuitive da Vinci research projects), and in residual whole-body humanoid control.

2.13 Offline RL

Train from a fixed dataset of logged transitions; no environment interaction. Critical when online exploration is unsafe or expensive.

BCQ — Batch-Constrained Q-Learning (Fujimoto 2019). Restricts policy to actions seen in data.
CQL — Conservative Q-Learning (Kumar 2020). Penalizes Q-values of unseen actions; conservative estimate prevents OOD exploitation.
IQL — Implicit Q-Learning (Kostrikov 2021). Avoids querying OOD actions by using only in-dataset actions.
Decision Transformer (Chen 2021). Cast offline RL as sequence modeling: condition on return-to-go; predict next action with GPT-style transformer.
Trajectory Transformer (Janner-Levine 2021). Similar; jointly predicts states, actions, rewards.

2.14 Cross-embodiment foundation models (VLAs)

The 2022+ wave: transformer / VLM backbones trained on pooled robot data + internet image-text data:

RT-1 (Brohan-Brown-Carbajal et al. Google 2022). 35M-param transformer; 130k demos × 13 skills on Everyday Robots mobile manipulator. First “robot transformer.” Action: tokenized discrete bins, 256 per dimension.
RT-2 (Brohan et al. Google DeepMind 2023). Co-finetune a PaLM-E / PaLI-X vision-language model on robot trajectories with text-tokenized actions. 5B-55B params. Inherits internet semantics (“pick up the can of Pepsi”). Slow inference (1-3 Hz).
Open-X-Embodiment + RT-X (Padalkar et al., 50+ labs 2023). Pool 22 embodiments × 527 skills × ~1M demos. Cross-embodiment pretraining boosts per-embodiment performance.
OpenVLA (Kim-Pertsch-Karamcheti et al. Stanford / TRI 2024). 7B-param VLA on Llama2-7B + DINOv2 + SigLIP fused vision. Open weights. LoRA-fine-tunable on a single RTX 4090.
Octo (Octo Model Team Berkeley 2024). 27M / 93M open-source diffusion-transformer generalist. Faster inference (50 Hz). Trained on Open-X-Embodiment.
π0 + π0.5 (Physical Intelligence 2024/2025). 3B-param flow-matching VLA. 50 Hz inference. Demonstrated cross-embodiment zero-shot (Franka data → 1X EVE). π0.5 adds generalization beyond training distribution.
GR00T N1 + N1.5 (NVIDIA 2024-2025). Humanoid-focused foundation model. Trained on human-MoCap + sim + real-robot mixture. Targets Unitree H1/G1, Apptronik Apollo, 1X NEO, Boston Dynamics electric Atlas via Isaac Lab.
Helix (Figure AI 2024-2025). Dual-system (S1 fast 200 Hz, S2 slow 5-10 Hz) onboard Figure 02 humanoid. Closed weights.
Cosmos (NVIDIA 2025). World-foundation-model for synthetic data generation; pair with GR00T for sim-to-real.

3. Locomotion via RL — the canonical wave

3.1 Tan-Bohez-Iscen 2018 Minitaur

Jie Tan, Tingnan Zhang, Erwin Coumans (Google Brain). Trained PPO policy in PyBullet for Minitaur quadruped (Ghost Robotics). DR over friction, mass, motor delay. Successfully transferred to hardware — first non-trivial sim-to-real locomotion RL on a real-world robot. Galvanized the field.

3.2 Hwangbo et al. ETH 2019 ANYmal Science Robotics

Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, et al. (ETH RSL — Hutter). Trained PPO policy in RaiSim (ETH’s in-house sim) for ANYmal. Key insight: actuator networks — train an MLP to predict joint torque from commanded position + history, capturing series-elastic actuator dynamics. Transfer 99%-successful. Science Robotics paper “Learning agile and dynamic motor skills for legged robots” — the field’s coming-out party.

3.3 Lee et al. ETH 2020 ANYmal rough terrain Science Robotics

Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, Marco Hutter. Teacher-student distillation: teacher policy has access to terrain heightmap; student gets only proprioception. Student must infer terrain from joint history. Deployed ANYmal across grass, gravel, snow, ice. The teacher-student pattern became standard.

3.4 Kumar-Fu-Pathak-Malik 2021 RMA (Berkeley)

Ashish Kumar, Zipeng Fu, Deepak Pathak, Jitendra Malik. RMA — Rapid Motor Adaptation. Train teacher with privileged dynamics input $z$ ; train adaptation module to predict $\overset{z}{^}$ from recent history. Deploy adaptation online with no gradient steps. A1 + Mini Cheetah on diverse terrains. Now the default sim-to-real architecture for legged.

3.5 Margolis-Yang-Paolini-Pathak-Kim 2023 MIT cheetah parkour

Gabriel Margolis, Ge Yang, Kartik Paigwar, Tao Chen, Pulkit Agrawal at MIT + CMU. Trained MIT Mini-Cheetah in Isaac Gym to do parkour: stair climbing, gap jumping, vertical leaps. Two-stage curriculum + DR + reward shaping. Open-sourced via Walk-These-Ways.

3.6 Cheng-Kumar-Pathak 2024 Berkeley extreme parkour

Xuxin Cheng, Kexin Shi, Ananye Agarwal, Deepak Pathak. Extreme parkour: 80 cm leaps, climbing 50 cm walls. Single end-to-end policy from proprioception + depth. Demonstrated Unitree A1 and Mini-Cheetah doing skateboarding-park-style obstacles.

3.7 Zhuang-Fu-Wang-Xu et al. 2024 Robot Parkour

Ziwen Zhuang et al. — separate parallel work, similar vein. Open-source release of parkour policies for Unitree Go1 and A1.

3.8 Humanoid RL 2023-2025

After quadrupeds, the humanoid wave:

OpenAI Dactyl (Andrychowicz et al. 2019/2020). Shadow Hand dexterous in-hand manipulation; PPO + LSTM + ADR (Automatic Domain Randomization).
Cassie / Digit (Agility Robotics + Crowley-Dao-Hurst 2022). 5 km outdoor run; PPO single end-to-end policy.
Atlas DRL (Boston Dynamics 2024). Production RL controller replacing MPC. Major update to commercial fleet.
Unitree H1 + G1 (Unitree 2024+). Open-source RL policies; community-trained checkpoints widely shared.
Berkeley HumanPlus (Cheng-Ji-Wang-Shi-Sferrazza 2024). Humanoid imitation from human video. SMPL retargeting.
OmniH2O (CMU + NVIDIA 2024). H1 whole-body teleop + retargeting.
Stanford ALOHA 2 + UMI (2024). Wearable teleop for humanoid imitation.

4. Manipulation via RL

4.1 Levine-Finn-Darrell-Abbeel 2016 GPS

Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel (Berkeley). Guided Policy Search — alternate between trajectory optimization (iLQG over local linear dynamics) and policy distillation. Trained PR2 robot to assemble Lego, hang clothes. Early demonstration that deep RL on real robots is feasible.

4.2 Andrychowicz 2017 HER

Marcin Andrychowicz, Filip Wolski, Alex Ray, et al. (OpenAI). Hindsight Experience Replay — relabel failed goals as achieved alternative goals. Solved Fetch-Push, Fetch-Slide, Fetch-Pick-And-Place from sparse rewards.

4.3 OpenAI Dactyl Rubik’s Cube 2019

Ilge Akkaya et al. (OpenAI). Shadow Hand 24-DoF. PPO + LSTM. Automatic Domain Randomization dynamically expands DR range with success. 13,000 years of simulated experience. Solved cube on hardware. Cited as the canonical proof of sim-to-real for dexterous.

4.4 Kalashnikov QT-Opt Google 2018

Dmitry Kalashnikov et al. (Google Brain). Real-world bin-picking RL. 7 KUKA arms running 24/7 for months; 800,000 grasp trials. Q-function + cross-entropy method action selection. Achieved 96% grasp success on novel objects. Demonstrated that real-world RL at scale works given enough robots + time.

4.5 RoboCat DeepMind 2023

Konstantinos Bousmalis et al. (DeepMind). Gato-style transformer trained on demonstrations from many robots + tasks. Self-improvement loop: deploy on robot, collect more data, retrain. Cross-embodiment positive transfer. Predecessor to industrial-scale VLAs.

4.6 Residual RL — Johannink-Bahl-Nair-Luo-Loquercio-Ross 2019

Tobias Johannink et al. (Sergey Levine + Pieter Abbeel groups). Residual policy on top of impedance controller for industrial insertion task. Reaches sub-mm precision with minutes of real-world data. The first compelling industrial demonstration of residual RL.

5. Practical math — training compute and sample efficiency

5.1 PPO compute budget for locomotion

Embodiment	Sim	Parallel envs	Steps to converge	Wall-clock	Hardware
MIT Mini-Cheetah	Isaac Gym	4096	1B	4 hr	RTX 3090
Unitree A1	Isaac Gym	4096	1B	4 hr	RTX 3090
ANYmal-C	Isaac Sim	4096	2B	8 hr	RTX 4090
Unitree H1 humanoid	Isaac Lab	4096	5B	24 hr	H100
Cassie (Agility)	MuJoCo	1024	500M	12 hr	A100

5.2 Sample efficiency comparison

Algorithm     | Sample efficiency relative to DQN
DQN baseline  | 1.0
Rainbow       | 5×
SAC           | 10×
TD3           | 8×
PETS          | 30× (small tasks)
Dreamer V3    | 50× (some tasks)
MuZero        | 100×
EfficientZero | 500×
Offline RL    | n/a (uses existing data)
Imitation     | 10000× (if demonstrations available)

5.3 Memory and parallelism — Isaac Lab scaling

GPU         | Parallel envs | Step rate
RTX 3060    | 1024          | 250k steps/s
RTX 3090    | 4096          | 1.0M steps/s
RTX 4090    | 8192          | 1.5M steps/s
A100 80GB   | 16384         | 3.0M steps/s
H100 80GB   | 32768         | 6.0M steps/s
H200 144GB  | 65536         | 10M steps/s

5.4 Cross-embodiment data scaling (Open-X-Embodiment)

Trained on	Per-embodiment improvement
Single embodiment, 10k demos	baseline
Add 5 more embodiments (50k mixed)	+15%
Open-X full mixture (1M, 22 embodiments)	+50%
+ LoRA fine-tune (50 task demos)	+60-80%

5.5 VLA inference latency

Model        | Params | Inference rate
RT-2 55B     | 55B    | 1-3 Hz (server)
RT-2 5B      | 5B     | 5 Hz (server)
OpenVLA      | 7B     | 6 Hz (A100)
π0           | 3B     | 50 Hz (H100)
π0.5         | 3B     | 50 Hz (H100)
Octo-Base    | 93M    | 50 Hz (RTX 4090)
Octo-Small   | 27M    | 100 Hz (RTX 4090)
GR00T N1     | 2B     | 50 Hz (Jetson Thor)
Helix S1     | small  | 200 Hz (Figure 02 onboard)
Helix S2     | large  | 5-10 Hz (Figure 02 onboard)

5.6 Reward shaping for locomotion

Standard reward terms for a quadruped:

$r = w_{v} \cdot v_{forward} - w_{θ} \cdot ∣ \dot{θ}_{yaw} ∣^{2} - w_{τ} \cdot ∣∣ τ ∣ ∣^{2} - w_{a} \cdot ∣∣ \overset{a}{˙} ∣ ∣^{2} - w_{o} \cdot ∣∣ \overset{ω}{˙}_{b} ∣ ∣^{2} + w_{s}$

Typical weights: $w_{v} = 1.0$ , $w_{θ} = 0.5$ , $w_{τ} = 0.0001$ , $w_{a} = 0.001$ , $w_{o} = 0.05$ , survival bonus $w_{s} = 0.5$ per step. Different teams use 8-30 reward terms; ablations consistently show 5-7 terms are sufficient with proper tuning.

5.7 Curriculum design

Standard curriculum stages:

Flat ground, low velocity command.
Flat ground, full velocity range.
Small terrain perturbations.
Stairs (up + down).
Slopes 0-30°.
Random rough terrain.
Push disturbances.
Payload mass changes.

Advance to next stage when success rate at current stage > 90% over 200 episodes.

6. Benchmark suites & simulators

6.1 Manipulation benchmarks

Suite	Authors	Embodiment	Tasks	Notes
RLBench	Shridhar et al. Imperial 2020	Franka	100+	CoppeliaSim-based
robosuite	Zhu-Lin et al. Stanford 2020	Franka, Sawyer	9 base	MuJoCo
ManiSkill / ManiSkill 3	Mu et al. UCSD 2021 / 2024	Various	30+	SAPIEN; large + curated
MetaWorld	Yu et al. Stanford 2019	Sawyer	50	Meta-RL focus
D’Claw / D’Hand (ROBEL)	Ahn-Yu et al. 2019	Custom	dexterous	Real-world testbed
PerAct	Shridhar Imperial 2022	Franka	language	Voxel-based; perceiver
Calvin	Mees et al. Freiburg 2022	Franka	language	Long-horizon language tasks
VLABench	various 2024	mixed	VLA eval	Standardized VLA eval
LIBERO	Liu et al. UT Austin 2023	Franka	continual	Lifelong learning

6.2 Locomotion benchmarks

Suite	Authors	Embodiment	Notes
DM Control Suite	Tassa et al. DeepMind 2018	MuJoCo humanoid + cheetah + etc	Continuous control standard
HumanoidBench	Sferrazza-Huang-Lin Berkeley 2024	Humanoid	Whole-body benchmark
Walk These Ways	Margolis-Agrawal MIT 2023	Mini-Cheetah	Open-source parkour
Quadruped-Gym (Isaac Lab)	NVIDIA 2023	ANYmal, Unitree	Production-grade
Locomotion-Bench	various	Cassie, Digit, A1	Cross-embodiment
Loco-Manip Bench	various 2024	Quadruped + arm	Mobile manipulation

6.3 Simulators (cross-reference with sim-to-real)

Simulator	Owner	Strength	RL-class
Isaac Lab / Isaac Sim	NVIDIA	GPU-resident, photorealistic	Manipulation + Locomotion
MuJoCo MJX	DeepMind	JAX, GPU-vectorized	Manipulation + Locomotion
Brax	Google	Pure JAX	Locomotion + research
Genesis	CMU 2024	Unified rigid + soft + fluid	Emerging
robosuite	Stanford ARISE	MuJoCo + manipulation focus	Manipulation
ManiSkill (SAPIEN)	UCSD	Manipulation benchmarks	Manipulation
RLBench	Imperial	CoppeliaSim manipulation	Manipulation
MuJoCo MPC	Howell-Lutter-Posa-Tedrake DeepMind 2022	Real-time MPC + RL	Hybrid
ROBEL	Ahn-Yu et al. 2019	Real-world cheap testbed	Real-world RL

7. Design heuristics

Start with PPO. It’s the workhorse; try other algorithms only after PPO clearly fails. SAC is the alternative for off-policy needs.
Spend 80% of time on the reward function. A good policy on a bad reward solves the wrong task. Iterate the reward in sim until visual behavior matches intent.
Normalize observations. Subtract mean, divide by std. Otherwise large-magnitude inputs dominate gradients.
Use action smoothing. Penalize $∣∣ \overset{a}{˙} ∣ ∣^{2}$ in reward; clip action change rate. Otherwise sim-only “vibration” exploits emerge.
Train at deployment control rate. A policy trained at 1000 Hz that runs at 50 Hz on hardware fails. Match rates.
Domain randomization is non-negotiable. Skip DR → sim policy fails on hardware. Start with literature defaults; tighten when sysid improves.
Use teacher-student. Privileged-teacher (sees true dynamics) → student (sees only deployable observations). Reliably better than direct student training.
Hand-design observations carefully. Including a feature that’s noisy on hardware but clean in sim → policy keys on it and fails. If in doubt, omit.
Curriculum > flat training. Linear curriculum (slowly increase difficulty) beats curriculum-free training by 2-10× in wall-clock.
Save policies often + log everything. RL training is unstable; the best policy is often not the latest. Keep a rolling top-5 and replay videos.
Test the policy on the failure modes you expect. Specifically test stairs, payload, pushes. Headline metrics hide tail-case failures.
Don’t fine-tune VLAs from scratch. Start with LoRA + small learning rate. Full fine-tune destroys priors.

8. Failure modes & debugging

Reward hacking. Policy maximizes reward proxy but doesn’t accomplish task. Detect: high reward but visually incorrect behavior. Fix: tighter reward design, sparse-reward shaping with intrinsic motivation.
NaN losses. Optimization diverges. Cause: too-large learning rate, normalization issue, unstable critic. Fix: gradient clipping, value-loss clipping, lower LR.
Sim-only exploit. Policy works in sim, fails in real. Cause: physics-engine-specific quirks. Fix: DR + sysid + match control rate + action smoothness.
Catastrophic forgetting. Policy good at task A, retrained on task B, fails A. Fix: PPO trust-region; replay buffer with old episodes; multi-task training.
Mode collapse. Continuous-action policy outputs same action regardless of state. Cause: low entropy, too-deterministic policy. Fix: entropy bonus, SAC with auto-temperature.
Slow training. PPO not converging after expected steps. Cause: reward poorly shaped, normalization missing, network too small/big. Fix: sanity-check on toy task first, scale gradients.
OOD generalization failure. Train on terrain A, fail on terrain B. Fix: more DR, broader curriculum, terrain randomization.
Critic divergence. Value estimates explode. Cause: bootstrapping with bad target. Fix: target network, value-loss clipping, GAE.
Action saturation. Policy commands joint torques at limit always. Fix: torque penalty in reward, clipping in environment, lower action scale.
Real-world inverse-kinematics blow-up. Policy commands end-effector position outside reachable workspace. Fix: kinematic feasibility filter, train with workspace constraints.
Failure to recover from disturbance. Push the robot — it falls. Fix: train with random push disturbances in DR.

9. Case studies

9.1 OpenAI Dactyl Rubik’s Cube (2018-2019)

Akkaya et al. OpenAI. Shadow Hand 24-DoF, MuJoCo sim, PPO + LSTM + ADR. Trained ~13,000 simulated years (8 months wall-clock, ~$3M GPU). Solved cube on hardware ~60% of the time. The single most-cited sim-to-real RL milestone. Later post-mortems noted narrow generalization (works on Cubes, not on novel objects), but the technique (PPO + LSTM memory + ADR) became standard.

9.2 ANYmal-C in production (ANYbotics 2020-2026)

ANYbotics (ETH spin-out) produces ANYmal-C/D for industrial inspection. Initial 2020 controllers were hand-coded MPC. 2022+ RL policies trained in RaiSim, distilled from teacher-student, deployed to customer fleets (BP Mad Dog, Shell Nyhamna, Statkraft, ÖBB rail). >99% session completion rate. Demonstrates that sim-to-real RL is production-grade.

9.3 Cassie 5 km outdoor run (Crowley-Dao-Hurst 2022)

Helei Duan, Jeremy Dao, Jonathan Hurst (Agility Robotics / OSU). Cassie biped runs 5 km on University of Michigan’s Wolverine track at ~5 m/s. Single end-to-end RL policy (PPO) trained in MuJoCo + Isaac Gym. DR over friction, mass, motor delay, inclines, pushes. First demonstration of end-to-end RL bipedal running outdoors. Precursor to all subsequent humanoid RL work.

9.4 MIT Mini-Cheetah parkour (Margolis et al. 2023)

Gabriel Margolis, Ge Yang, Kartik Paigwar, Tao Chen, Pulkit Agrawal (MIT). Two-stage curriculum: blind locomotion, then vision-guided. Stairs, gaps up to 60 cm, vertical leaps. Open-sourced “Walk These Ways” (WTW) — became reference codebase. Sparked the parkour wave (Cheng et al. Berkeley, Zhuang et al., etc.).

9.5 OpenAI Rubik’s Cube → Tesla Optimus (claimed) / Figure / 1X / Apptronik

The Dactyl + Cassie + ANYmal recipe (PPO + DR + teacher-student + Isaac Lab) is what every humanoid program now uses. Tesla Optimus walks outdoors (claimed RL+MPC hybrid); Figure 02 + Helix combines VLA (S2) + reactive RL controller (S1); 1X NEO Gamma uses cable-driven hardware + learned policies; Apptronik Apollo similar. As of 2026 most humanoid OEMs are converging on the NVIDIA Isaac Lab + GR00T pipeline.

9.6 Diffusion Policy + ALOHA (Stanford 2023-2024)

Cheng Chi, Siyuan Feng, Yilun Du et al. (Columbia + Stanford). Diffusion Policy (RSS 2023). Stanford Mobile ALOHA (2024). Bimanual dexterous tasks (folding, cooking, dishwashing) via teleop + diffusion-policy + co-training. Demonstrated 30-minute end-to-end household task completion. Watershed for IL+RL hybrid manipulation.

9.7 Open-X-Embodiment + RT-X (50+ labs 2023)

Padalkar et al. + 50+ labs pooled their robot data → 1M trajectories, 22 embodiments, 527 skills. RT-1-X trained on the union improved over per-lab RT-1 by ~50%. Established cross-embodiment training as standard. Foundation for OpenVLA, Octo, π0.

9.8 OpenVLA (Stanford / TRI 2024)

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al. (Stanford + TRI + UC Berkeley + Google). 7B-param VLA. Llama2-7B language backbone + DINOv2 + SigLIP fused vision. Trained on OXE. Open weights + LoRA fine-tuning recipe. First widely-deployable open VLA. Single RTX 4090 can LoRA-fine-tune. Powered hundreds of academic + startup deployments through 2025.

9.9 π0 + π0.5 (Physical Intelligence 2024-2025)

Sergey Levine, Karol Hausman, Chelsea Finn-affiliated team founded Physical Intelligence (PI). π0: 3B-param flow-matching VLA. Trained on ~1M demonstrations from 7 robot platforms. 50 Hz inference. Public demos: laundry folding, table bussing, packing groceries. Zero-shot generalization to embodiments not in training data (1X EVE, after training only on Franka data). π0.5 (2025): generalization beyond training distribution. Open weights for research. PI’s commercial demos in 2025 spawned dozens of integrator deployments.

9.10 GR00T N1 + Cosmos (NVIDIA 2024-2025)

GR00T (Generalist Robot 00 Technology): NVIDIA’s humanoid foundation model. Trained on a mixture of human-MoCap data, sim trajectories, real-robot demonstrations. Embodiment partners: Unitree H1/G1, Apptronik Apollo, 1X NEO, Fourier GR-1, Boston Dynamics electric Atlas. Cosmos: NVIDIA’s world-foundation-model for synthetic data generation. GR00T + Cosmos + Isaac Lab + Jetson Thor is NVIDIA’s full-stack humanoid offering — sell the compute and the model to humanoid OEMs lacking in-house RL/IL talent.

9.11 Figure Helix (2025)

Figure AI’s onboard VLA for Figure 02 humanoid. Dual-system: S1 (200 Hz reactive ViT-style) + S2 (5-10 Hz large VLM reasoning). Demo: two humanoids cooperatively put groceries away in a kitchen, with continuous voice instruction. First public humanoid VLA running fully onboard at production rates. Sparked the humanoid foundation model deployment wave at Figure, 1X, Apptronik, NVIDIA.

9.12 Toyota Research Institute Diffusion Policy + Hyperdiffusion (2024-2025)

TRI built a Diffusion Policy variant (sometimes called “Hyperdiffusion”) for robotic manipulation. Trained on ~1000 teleop demos per task. Demonstrated 200+ tasks deployed on TRI’s research arms. Released open-weight checkpoints for several manipulation primitives. Validated diffusion policies in industrial-strength integration.

9.13 Decision Transformer for offline RL (Chen et al. UC Berkeley 2021)

Lili Chen, Kevin Lu, Aravind Rajeswaran et al. Cast offline RL as sequence modeling: condition on return-to-go; predict next action with causal transformer. Strong on D4RL benchmarks. Predecessor to all transformer-based VLA work — same architectural primitive.

9.14 Wu-Escontrela-Pathak-Shen 2022 — Dreamer on real quadruped

Philipp Wu, Alejandro Escontrela, Deepak Pathak, William Yang. Trained DreamerV2 (Hafner 2020) latent-imagination policy on a real Unitree A1 quadruped from raw images. Walking learned in 1 hour of real-world data. Demonstrates model-based RL can be sample-efficient enough for direct real-world training without sim.

10. Safe and constrained RL

10.1 Constrained MDPs

A CMDP adds constraint signals $c_{i} (s, a)$ and budgets $d_{i}$ to the standard MDP. The policy must satisfy $E [\sum_{t} c_{i} (s_{t}, a_{t})] \leq d_{i}$ while maximizing reward. Methods:

CPO — Constrained Policy Optimization (Achiam 2017). Trust-region method with constraint line-search.
Lagrangian methods. Treat constraints with dual variables; jointly optimize. Standard in safe-RL libraries (Safe-RLlib, OmniSafe).
Reward-augmentation. Push constraint violation into the reward; brittle but simple.
CBF — Control Barrier Functions (Ames-Coogan-Egerstedt 2019). Filter actions through a safety layer that guarantees the system stays in a safe set.

Used in industrial deployments where regulatory or physical safety constraints are hard. Anymal’s RL controller includes CBF-style joint limit + torque limit filters at deployment.

10.2 Safe exploration during training

Recoverable states. Start in safe state; abort and reset when leaving recovery region.
Pretrained safety filter. Train a critic that predicts “will this action lead to a failure within N steps?”; only sample safe actions during exploration.
Human-in-the-loop. Operator vetoes dangerous actions; learns over time. Used in surgical-robot RL projects.

10.3 Risk-sensitive RL

Standard RL optimizes the expectation. Risk-sensitive variants:

CVaR (Conditional Value at Risk) — optimize the expected return conditional on the worst $α$ %.
Distributional RL (Bellemare-Dabney-Munos 2017 C51, IQN, QR-DQN). Learn the full return distribution, not just the mean.
Worst-case RL. Minimax over adversarial dynamics. Pinto’s RARL.

11. Skills, hierarchy, and long-horizon RL

11.1 Options framework (Sutton-Precup-Singh 1999)

An option is a temporally-extended action: $(π_{o}, I_{o}, β_{o})$ where $I_{o}$ is the initiation set, $π_{o}$ the option’s policy, $β_{o}$ the termination function. The high-level policy chooses options; each option runs until termination. Decomposes long-horizon tasks.

11.2 Hierarchical RL

HRL via subgoal generation. High-level policy proposes subgoals; low-level policy reaches them. HIRO (Nachum et al. 2018), HAC (Levy 2019).
MAXQ decomposition (Dietterich 2000). Recursive task graph.
Skill-conditioned policies (DIAYN, Eysenbach 2019). Discover skills via mutual-information maximization between latent skill and trajectories.

11.3 Diffusion + skill chaining

Recent trend: a diffusion policy generates an action chunk (50-100 steps); the system re-plans at chunk boundaries. Effectively a 2-level hierarchy without explicit subgoals. Chi et al. Diffusion Policy + ALOHA ACT are the canonical examples; π0 generalizes to flow-matching action chunks.

11.4 Language-conditioned policies

LLMs ground task decomposition. SayCan (Brohan-Ichter et al. Google 2022): LLM proposes high-level plan; affordance model picks executable next step; low-level skill policy executes. Code-as-Policies (Liang-Huang-Liu-Brown-Yang-Brohan 2023): LLM emits Python code calling primitives. Voxposer (Huang-Wang-Mendonca-Held-Pathak 2023): LLM generates value maps for downstream MPC.

12. Foundation-model details

12.1 Action representation

VLA / generalist policies vary in how they emit actions:

Discrete tokens (RT-2, OpenVLA). Discretize each action dimension into 256 bins; the model emits a token sequence per timestep. Compatible with LM training infrastructure.
Diffusion (Octo, Diffusion Policy). Predict noise iteratively; produce continuous action chunks.
Flow matching (π0). Learn the velocity field of a continuous-time flow from noise to action. Faster than diffusion (4-8 steps vs 16-100).
Direct regression (ACT, simple BC). Predict actions in a single forward pass. Cheap but cannot model multi-modal distributions.
Mixture-of-Gaussians (Robomimic baselines). K-mode GMM head; multi-modal but limited K.

12.2 Vision backbones

DINOv2 (Oquab et al. Meta 2023). Self-supervised ViT; widely used in OpenVLA, π0.
SigLIP (Zhai-Mustafa-Kolesnikov-Beyer Google 2023). Sigmoid-loss CLIP variant; better than CLIP for retrieval.
CLIP ViT-L (Radford et al. OpenAI 2021). Original vision-language alignment.
R3M (Nair-Rajeswaran-Kumar-Finn-Gupta Stanford 2022). Pretrained on Ego4D for manipulation.
VC-1 / VIP (Majumdar et al. Meta 2023). Visual cortex for embodied AI.
EVA / EVA-02 (Fang-Wang-Xie-Sun-Wu-Wang-Cao 2023). Open ViT pretraining.

12.3 Cross-embodiment recipe

The Open-X-Embodiment recipe that became standard:

Tokenize observations: image (ViT patches), text (LLaMA tokens), proprioception (linear projection).
Tokenize actions per dimension to 256 bins; emit one token per dim per timestep.
Encode robot embodiment with a learnable embedding ID; concatenate to observation tokens.
Train with next-token loss on shuffled cross-embodiment data.
At deployment, decode tokens to continuous actions via inverse bin mapping.

The single shared model handles 22 embodiments with the same parameters — cross-embodiment generalization emerges from training distribution diversity.

12.4 Compute budget for VLA training

Model	Params	Pretraining data	Compute	Wall-clock
RT-1	35M	130k demos	8 TPU-v3	days
RT-2 5B	5B	OXE + internet	64 TPU-v4	weeks
OpenVLA 7B	7B	OXE	64 A100	14 days
Octo Base	93M	OXE	8 H100	days
π0	3B	~1M demos × 7 platforms	256 H100	weeks
GR00T N1	~2B	mixed sim + real + MoCap	1024 H100	weeks

Fine-tuning is dramatically cheaper. OpenVLA LoRA fine-tune: 50-200 demos, 4-24 hours on a single RTX 4090.

13. Practical training tips and reward-design patterns

13.1 Standard locomotion reward template

# Velocity tracking
r_v   = w_v * exp(-||v_cmd - v_actual||^2 / sigma_v)
# Yaw tracking
r_yaw = w_yaw * exp(-||yaw_cmd - yaw_actual||^2 / sigma_yaw)
# Energy / smoothness penalties
r_t   = -w_t * ||tau||^2
r_a   = -w_a * ||a_dot||^2
r_jerk= -w_jerk * ||a_ddot||^2
r_q   = -w_q * ||q - q_default||^2  # nominal posture
r_orient = -w_o * ||base_roll_pitch||^2
# Foot contact / clearance
r_foot = w_foot * sum(foot_clearance > threshold during swing)
# Survival bonus
r_alive = w_alive  # per step alive
# Terminal cost on falls
r_term = -w_fall if base_height < threshold or base_orient_angle > limit

Tune weights iteratively by inspecting policy behavior in sim. Common pitfall: too-high velocity weight → policy ignores torque cost → fast but jittery.

13.2 Standard manipulation reward template

# Reach phase
r_reach = -||p_ee - p_goal||
# Grasp phase
r_grasp = +w_grasp if contact and object_lifted
# Place phase
r_place = -||p_obj - p_goal||
# Action smoothness
r_smooth = -w_smooth * ||a_dot||^2
# Collision penalty
r_collision = -w_col if collision_external
# Sparse success
r_success = +R_terminal if task_complete

13.3 PPO hyperparameter defaults (legged locomotion)

learning_rate: 3e-4 (linear decay to 1e-5)
discount gamma: 0.99
GAE lambda: 0.95
clip range: 0.2
value loss coef: 1.0
entropy coef: 0.005
batch size: 24576 (4096 envs × 6 steps minibatch)
training epochs: 5
KL target (early-stop): 0.01
network: 3 hidden layers, [512, 256, 128] units, ELU activation

13.4 SAC hyperparameter defaults (manipulation)

learning_rate: 3e-4
batch size: 256
buffer size: 1e6
gamma: 0.99
tau (target update): 0.005
auto entropy tuning: on
target entropy: -dim(action)
warmup steps: 10000

14. Tooling ecosystem

14.1 Open-source training libraries

Library	Maintainer	Use
Stable-Baselines3	community	Reliable PPO/SAC/TD3 for CPU/GPU
RLlib (Ray)	Anyscale	Production-grade distributed RL
CleanRL	Costa Huang community	Single-file readable implementations
Tianshou	TsinghuaML	Modular PyTorch RL
Acme	DeepMind	Research-focused
Sample Factory	community	Throughput-optimized async PPO
RSL_RL	Robotic Systems Lab ETH	PPO for legged; standard in Isaac Lab
RL Games	NVIDIA	Used in Isaac Gym / Isaac Lab
skrl	community	Isaac Lab-compatible

14.2 Foundation-model toolkits

Library	Provider	Use
LeRobot	Hugging Face	End-to-end IL pipeline (ACT, Diffusion Policy, π0)
openpi	Physical Intelligence	π0 / π0.5 inference + fine-tune
openvla	Stanford TRI	OpenVLA inference + LoRA
octo	UC Berkeley	Octo training + checkpoints
robosuite	Stanford	Manipulation benchmarks on MuJoCo
Diffusion Policy ref	Columbia	Reference Diffusion Policy code
TRL	Hugging Face	RLHF library (LLM RL)

14.3 Deployment / inference

Tool	Use
ONNX Runtime	Cross-platform inference
TensorRT	NVIDIA-optimized inference
TorchScript	PyTorch deployment
OpenVINO	Intel CPU/GPU inference
TFLite / Coral	Edge inference (small models)
Triton Inference Server	NVIDIA server-side serving

14.4 Logging and experiment tracking

Tool	Use
Weights & Biases	Cloud experiment tracking
TensorBoard	Local visualization
MLflow	Open-source experiment tracking
Aim	Open-source ML tracking
Neptune	Cloud tracking with collaboration

15. Open challenges (2026)

Sample efficiency in real world. Sim-to-real works but the sim isn’t always good enough. Real-world RL on legged + manipulators is still slow.
Safe exploration. No widely-deployed solution for safe random exploration on expensive humanoids.
Long-horizon multi-task. Tasks > 60 s with multiple skills still require hierarchy + scripting; pure RL fails to assemble.
Generalization across embodiments. OXE shows positive transfer but a Franka policy still doesn’t directly work on a UR5 without fine-tuning.
Language conditioning for compositional behaviors. VLAs understand simple commands; compositional (“first do X, then if Y, do Z”) remains brittle.
Reward specification. Hand-designed rewards are brittle; preference learning + LLM-generated rewards (Eureka, Yu et al. 2023) are early-stage.
Real-time inference on edge hardware. Large VLAs need server GPUs. Distillation + dual-system (Helix S1/S2) are partial answers; smaller native models are an open frontier.
Robustness to adversarial perturbations. Policies can be fooled by trivial visual changes. Adversarial training is research-grade, not production-ready.

16. Glossary

A3C / A2C — Asynchronous (or Synchronous) Advantage Actor-Critic (Mnih 2016).
Action chunking — predicting + executing multi-step action sequences (ACT, Diffusion Policy).
ACT — Action Chunking with Transformers — Zhao 2023 ALOHA standard.
ADR — Automatic Domain Randomization — adaptive DR scheduling (OpenAI Dactyl 2019).
BC — Behavior Cloning — supervised learning from demonstrations.
CMDP — Constrained MDP — adds budget constraints to standard MDP.
CPO — Constrained Policy Optimization — Achiam 2017.
CTDE — Centralized Training, Decentralized Execution — MARL paradigm.
CQL — Conservative Q-Learning — offline RL (Kumar 2020).
DAgger — Dataset Aggregation — interactive IL (Ross 2011).
DDPG — Deep Deterministic Policy Gradient — Lillicrap 2015.
Decision Transformer — sequence-modeling offline RL (Chen 2021).
Diffusion Policy — generative-model action policy (Chi 2023).
DR — Domain Randomization — vary sim parameters during training.
DREAMER — model-based latent-imagination RL (Hafner 2019-2023).
GAE — Generalized Advantage Estimation — variance-reduced advantage estimator.
GAIL — Generative Adversarial Imitation Learning — Ho-Ermon 2016.
GPS — Guided Policy Search — Levine-Abbeel 2016.
HER — Hindsight Experience Replay — Andrychowicz 2017.
HRL — Hierarchical RL — multi-level policy decomposition.
IL — Imitation Learning — learning from demonstrations.
IQL — Implicit Q-Learning — offline RL (Kostrikov 2021).
IRL — Inverse Reinforcement Learning — infer reward from demonstrations.
LoRA — Low-Rank Adaptation — efficient fine-tuning (Hu 2021).
MDP — Markov Decision Process — standard RL formulation.
MPC — Model Predictive Control — receding-horizon optimization.
MuZero — learned-dynamics + MCTS (Schrittwieser 2020).
OXE — Open-X-Embodiment — cross-embodiment robot dataset (2023).
PETS — Probabilistic Ensembles + Trajectory Sampling — model-based RL (Chua 2018).
PILCO — Gaussian-Process model-based RL (Deisenroth 2011).
PPO — Proximal Policy Optimization — Schulman 2017; the workhorse.
QT-Opt — large-scale real-world Q-learning (Kalashnikov 2018).
RL — Reinforcement Learning — sequential decision-making via reward.
RMA — Rapid Motor Adaptation — Kumar 2021 sim-to-real bridge.
RT-1 / RT-2 / RT-X — Robotics Transformer foundation models (Google 2022-2023).
SAC — Soft Actor-Critic — max-entropy off-policy (Haarnoja 2018).
TD3 — Twin Delayed DDPG — Fujimoto 2018.
TRPO — Trust Region Policy Optimization — Schulman 2015.
VLA — Vision-Language-Action model — multimodal generalist policy.
VLM — Vision-Language Model — image + text foundation model.

Compendium

Explorer

Robot Learning & RL — From PPO Locomotion to VLA Foundation Models

Robot Learning & RL — From PPO Locomotion to VLA Foundation Models

See also

1. At a glance

2. First principles — algorithm taxonomy

2.1 The MDP framework

2.2 Policy gradient (REINFORCE)

2.3 DQN — Deep Q-Network (Mnih et al. DeepMind 2013/2015)

2.4 A3C / A2C — Asynchronous Advantage Actor-Critic (Mnih 2016)

2.5 TRPO — Trust Region Policy Optimization (Schulman 2015)

2.6 PPO — Proximal Policy Optimization (Schulman 2017)

2.7 DDPG — Deep Deterministic Policy Gradient (Lillicrap 2015)

2.8 TD3 — Twin Delayed DDPG (Fujimoto 2018)

2.9 SAC — Soft Actor-Critic (Haarnoja 2018)

2.10 Model-based RL

2.11 Hindsight Experience Replay (HER, Andrychowicz 2017)

2.12 Residual RL (Silver-Allen-Tedrake 2018, Johannink-Bahl-Nair 2019)

2.13 Offline RL

2.14 Cross-embodiment foundation models (VLAs)

3. Locomotion via RL — the canonical wave

3.1 Tan-Bohez-Iscen 2018 Minitaur

3.2 Hwangbo et al. ETH 2019 ANYmal Science Robotics

3.3 Lee et al. ETH 2020 ANYmal rough terrain Science Robotics

3.4 Kumar-Fu-Pathak-Malik 2021 RMA (Berkeley)

3.5 Margolis-Yang-Paolini-Pathak-Kim 2023 MIT cheetah parkour

3.6 Cheng-Kumar-Pathak 2024 Berkeley extreme parkour

3.7 Zhuang-Fu-Wang-Xu et al. 2024 Robot Parkour

3.8 Humanoid RL 2023-2025

4. Manipulation via RL

4.1 Levine-Finn-Darrell-Abbeel 2016 GPS

4.2 Andrychowicz 2017 HER

4.3 OpenAI Dactyl Rubik’s Cube 2019

4.4 Kalashnikov QT-Opt Google 2018

4.5 RoboCat DeepMind 2023

4.6 Residual RL — Johannink-Bahl-Nair-Luo-Loquercio-Ross 2019

5. Practical math — training compute and sample efficiency

5.1 PPO compute budget for locomotion

5.2 Sample efficiency comparison

5.3 Memory and parallelism — Isaac Lab scaling

5.4 Cross-embodiment data scaling (Open-X-Embodiment)

5.5 VLA inference latency

5.6 Reward shaping for locomotion

5.7 Curriculum design

6. Benchmark suites & simulators

6.1 Manipulation benchmarks

6.2 Locomotion benchmarks

6.3 Simulators (cross-reference with sim-to-real)

7. Design heuristics

8. Failure modes & debugging

9. Case studies

9.1 OpenAI Dactyl Rubik’s Cube (2018-2019)

9.2 ANYmal-C in production (ANYbotics 2020-2026)

9.3 Cassie 5 km outdoor run (Crowley-Dao-Hurst 2022)

9.4 MIT Mini-Cheetah parkour (Margolis et al. 2023)

9.5 OpenAI Rubik’s Cube → Tesla Optimus (claimed) / Figure / 1X / Apptronik

9.6 Diffusion Policy + ALOHA (Stanford 2023-2024)

9.7 Open-X-Embodiment + RT-X (50+ labs 2023)

9.8 OpenVLA (Stanford / TRI 2024)

9.9 π0 + π0.5 (Physical Intelligence 2024-2025)

9.10 GR00T N1 + Cosmos (NVIDIA 2024-2025)

9.11 Figure Helix (2025)

9.12 Toyota Research Institute Diffusion Policy + Hyperdiffusion (2024-2025)

9.13 Decision Transformer for offline RL (Chen et al. UC Berkeley 2021)

9.14 Wu-Escontrela-Pathak-Shen 2022 — Dreamer on real quadruped

10. Safe and constrained RL

10.1 Constrained MDPs

10.2 Safe exploration during training

10.3 Risk-sensitive RL

11. Skills, hierarchy, and long-horizon RL

11.1 Options framework (Sutton-Precup-Singh 1999)

11.2 Hierarchical RL

11.3 Diffusion + skill chaining

11.4 Language-conditioned policies

12. Foundation-model details

12.1 Action representation

12.2 Vision backbones

12.3 Cross-embodiment recipe

12.4 Compute budget for VLA training