Robot Learning & RL — From PPO Locomotion to VLA Foundation Models

The 2018-2026 maturation of reinforcement learning for physical robots: model-free policy gradient methods (PPO, SAC, TD3) trained at massive parallelism in GPU simulators (Isaac Lab, Brax, MuJoCo MJX), model-based methods (Dreamer, PILCO, PETS) for sample-efficient real-world learning, sim-to-real bridges (domain randomization, RMA — Rapid Motor Adaptation), and the explosion of robot foundation models — RT-1, RT-2, RT-X, OpenVLA, Octo, π0, π0.5, GR00T, Helix — that consolidated cross-embodiment manipulation into a few canonical recipes. This note is the technical companion to rl-for-control (control-focused), imitation-learning (LfD/diffusion/VLA-focused), and sim-to-real (simulator + DR-focused), with attention to the algorithms-and-platforms intersection: which RL method works on which embodiment, which simulator + training recipe produced which result, and where the field’s open problems live (sample efficiency, safe exploration, long-horizon generalization, cross-embodiment transfer, language conditioning).

See also

1. At a glance

Reinforcement learning trains a policy to maximize expected discounted return . The robotics-specific challenges are:

  • Sample efficiency. A real robot generates ~1000 transitions / minute. Atari DQN needed 200M transitions — infeasible on hardware. Solutions: sim, model-based RL, offline RL on logged data, sim-to-real transfer.
  • Safety during exploration. A random policy on a humanoid falls. On a \100k$ robot it falls expensively. Solutions: simulator training, safety filters (CBFs), constrained MDPs, residual RL over a stable nominal controller.
  • Reward design. Sparse rewards (“succeed or fail”) give no gradient. Dense rewards bias toward local optima. Reward shaping + curriculum + intrinsic motivation. Or skip rewards: imitation learning.
  • Generalization. Train on terrain A, test on terrain B. Cross-embodiment: train on Franka, deploy on UR5. Until 2022 this didn’t generalize; 2022+ VLAs do.
  • Long horizon. 1000-step tasks compound errors. Hierarchy (options, skills, latent plans) or transformer-based memory.

The 2024-2026 state of practice splits into three regimes:

  1. Locomotion-class. Massively parallel PPO in Isaac Lab / Brax / MuJoCo MJX. 4096-65536 parallel envs. Train for hours, deploy as a closed-loop policy at 50-1000 Hz. Methods: PPO, SAC, RMA, teacher-student distillation. Embodiments: ANYmal, Cassie/Digit, Atlas, Unitree H1/G1/B2, Tesla Optimus, Figure 02, 1X NEO.
  2. Manipulation-class. Imitation learning (BC, ACT, Diffusion Policy) on teleop data, plus residual RL or RL fine-tuning. Methods: Diffusion Policy, ACT, RT-2, OpenVLA, π0. Embodiments: Franka, UR5, xArm, ALOHA, Stretch, Tiago.
  3. Foundation-model-class. Pretrained VLA → LoRA fine-tune for new task / embodiment. Methods: OpenVLA, π0.5, Octo, GR00T, Helix. Recipes: Open-X-Embodiment pretraining + 50-500 task-specific demos.

Where this sits. RL for control is the parent overview; sim-to-real is the deployment bridge; imitation learning is the demonstration-based companion; MPC is the model-based alternative often hybridized with RL; legged locomotion and manipulators are the primary embodiments; vision feeds VLAs.

First ask. Is the policy continuous or discrete? Continuous (joint targets, end-effector pose) → SAC or PPO; PPO is the workhorse. On-policy or off-policy? On-policy (PPO) is simpler but sample-hungry — fine for sim. Off-policy (SAC, TD3) for real data. Model-free or model-based? Model-based (Dreamer, PILCO, PETS) for ≤10× sample efficiency at cost of model complexity. Do you have demonstrations? If yes, start with BC; fine-tune with RL only if needed. Single task or general? Single → train from scratch. Generalist → fine-tune VLA.

2. First principles — algorithm taxonomy

2.1 The MDP framework

A Markov Decision Process is :

  • : state space (joint positions/velocities, sensor readings, images).
  • : action space (joint torques, motor targets, end-effector poses).
  • : transition dynamics.
  • : reward function.
  • : discount factor.

Policy maximizes . Value functions: and .

2.2 Policy gradient (REINFORCE)

High variance; needs baseline to subtract. Advantage .

2.3 DQN — Deep Q-Network (Mnih et al. DeepMind 2013/2015)

The breakthrough that started deep RL. Approximate with a deep network, train via Bellman error on a replay buffer:

Target network stabilizes training. Atari 49-game superhuman. Discrete actions only; rarely used for robotics control directly but spawned the field.

2.4 A3C / A2C — Asynchronous Advantage Actor-Critic (Mnih 2016)

Parallel actor workers; synchronous (A2C) or asynchronous (A3C). Reduced variance via advantage. Predecessor to PPO.

2.5 TRPO — Trust Region Policy Optimization (Schulman 2015)

Constrains KL divergence between old and new policy:

subject to

Second-order optimization via natural gradient. Stable but expensive.

2.6 PPO — Proximal Policy Optimization (Schulman 2017)

The workhorse of robotics RL. Approximates TRPO via a clipped objective:

where . Clip typically 0.2. First-order SGD; trains stably. Used in OpenAI Dactyl, every ANYmal RL controller, Cassie running, ALOHA’s RL components, every Isaac Lab humanoid policy.

2.7 DDPG — Deep Deterministic Policy Gradient (Lillicrap 2015)

Off-policy actor-critic for continuous actions. Deterministic policy + critic . Replay buffer + target networks. Sample efficient but brittle to hyperparams.

2.8 TD3 — Twin Delayed DDPG (Fujimoto 2018)

Three additions to DDPG: clipped double-Q (use min of two critics to reduce overestimation), delayed policy updates (update actor less often than critic), target policy smoothing (add noise to next-state action). Stable replacement for DDPG.

2.9 SAC — Soft Actor-Critic (Haarnoja 2018)

Maximum-entropy off-policy actor-critic. Objective — adds an entropy bonus. Two critics + auto-temperature tuning. Sample efficient; the standard choice when you have real-world data and want off-policy reuse.

2.10 Model-based RL

Learn a dynamics model from data; then plan or distill a policy.

PILCO — Probabilistic Inference for Learning Control (Deisenroth + Rasmussen 2011). Gaussian Process dynamics model + analytic policy gradient. Famously trained a cart-pole in 20 trials. Doesn’t scale to image observations.

PETS — Probabilistic Ensembles with Trajectory Sampling (Chua et al. 2018). Bootstrapped ensemble of dynamics models; MPC over the ensemble. Strong on MuJoCo continuous control. Used in real-world RL.

Dreamer V1/V2/V3 (Hafner-Lillicrap-Norouzi-Ba, DeepMind/Google 2019/2020/2023). Learn a latent dynamics model (RSSM — Recurrent State-Space Model) from pixels; train policy via imagined rollouts in latent space. Dreamer V3 (2023) is the first algorithm to solve Minecraft Diamond from pixels with no priors. Robotics deployments: real-quadruped grass walking (Wu-Escontrela-Pathak-Shen 2022).

MuZero / EfficientZero (Schrittwieser 2020; Ye-Liu-Kurutach 2021). Learn dynamics + value + policy jointly; plan via MCTS. EfficientZero achieves human-level Atari in 2 hours of data — 100× improvement over Rainbow DQN. Less used in robotics due to discrete-action assumption.

2.11 Hindsight Experience Replay (HER, Andrychowicz 2017)

For sparse-reward goal-conditioned RL. Relabel failed trajectories with achieved-goal as if it were the desired goal. Turns every trajectory into a successful one for some goal. Critical for pushing / grasping tasks.

2.12 Residual RL (Silver-Allen-Tedrake 2018, Johannink-Bahl-Nair 2019)

Decompose policy as . Nominal is a hand-coded or model-based controller; residual is learned. Sample efficient (only learn the residual), safe (nominal provides a baseline), interpretable. Used in industrial-arm RL (Universal Robotics’ residual controllers), in surgical RL (intuitive da Vinci research projects), and in residual whole-body humanoid control.

2.13 Offline RL

Train from a fixed dataset of logged transitions; no environment interaction. Critical when online exploration is unsafe or expensive.

  • BCQ — Batch-Constrained Q-Learning (Fujimoto 2019). Restricts policy to actions seen in data.
  • CQL — Conservative Q-Learning (Kumar 2020). Penalizes Q-values of unseen actions; conservative estimate prevents OOD exploitation.
  • IQL — Implicit Q-Learning (Kostrikov 2021). Avoids querying OOD actions by using only in-dataset actions.
  • Decision Transformer (Chen 2021). Cast offline RL as sequence modeling: condition on return-to-go; predict next action with GPT-style transformer.
  • Trajectory Transformer (Janner-Levine 2021). Similar; jointly predicts states, actions, rewards.

2.14 Cross-embodiment foundation models (VLAs)

The 2022+ wave: transformer / VLM backbones trained on pooled robot data + internet image-text data:

  • RT-1 (Brohan-Brown-Carbajal et al. Google 2022). 35M-param transformer; 130k demos × 13 skills on Everyday Robots mobile manipulator. First “robot transformer.” Action: tokenized discrete bins, 256 per dimension.
  • RT-2 (Brohan et al. Google DeepMind 2023). Co-finetune a PaLM-E / PaLI-X vision-language model on robot trajectories with text-tokenized actions. 5B-55B params. Inherits internet semantics (“pick up the can of Pepsi”). Slow inference (1-3 Hz).
  • Open-X-Embodiment + RT-X (Padalkar et al., 50+ labs 2023). Pool 22 embodiments × 527 skills × ~1M demos. Cross-embodiment pretraining boosts per-embodiment performance.
  • OpenVLA (Kim-Pertsch-Karamcheti et al. Stanford / TRI 2024). 7B-param VLA on Llama2-7B + DINOv2 + SigLIP fused vision. Open weights. LoRA-fine-tunable on a single RTX 4090.
  • Octo (Octo Model Team Berkeley 2024). 27M / 93M open-source diffusion-transformer generalist. Faster inference (50 Hz). Trained on Open-X-Embodiment.
  • π0 + π0.5 (Physical Intelligence 2024/2025). 3B-param flow-matching VLA. 50 Hz inference. Demonstrated cross-embodiment zero-shot (Franka data → 1X EVE). π0.5 adds generalization beyond training distribution.
  • GR00T N1 + N1.5 (NVIDIA 2024-2025). Humanoid-focused foundation model. Trained on human-MoCap + sim + real-robot mixture. Targets Unitree H1/G1, Apptronik Apollo, 1X NEO, Boston Dynamics electric Atlas via Isaac Lab.
  • Helix (Figure AI 2024-2025). Dual-system (S1 fast 200 Hz, S2 slow 5-10 Hz) onboard Figure 02 humanoid. Closed weights.
  • Cosmos (NVIDIA 2025). World-foundation-model for synthetic data generation; pair with GR00T for sim-to-real.

3. Locomotion via RL — the canonical wave

3.1 Tan-Bohez-Iscen 2018 Minitaur

Jie Tan, Tingnan Zhang, Erwin Coumans (Google Brain). Trained PPO policy in PyBullet for Minitaur quadruped (Ghost Robotics). DR over friction, mass, motor delay. Successfully transferred to hardware — first non-trivial sim-to-real locomotion RL on a real-world robot. Galvanized the field.

3.2 Hwangbo et al. ETH 2019 ANYmal Science Robotics

Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, et al. (ETH RSL — Hutter). Trained PPO policy in RaiSim (ETH’s in-house sim) for ANYmal. Key insight: actuator networks — train an MLP to predict joint torque from commanded position + history, capturing series-elastic actuator dynamics. Transfer 99%-successful. Science Robotics paper “Learning agile and dynamic motor skills for legged robots” — the field’s coming-out party.

3.3 Lee et al. ETH 2020 ANYmal rough terrain Science Robotics

Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, Marco Hutter. Teacher-student distillation: teacher policy has access to terrain heightmap; student gets only proprioception. Student must infer terrain from joint history. Deployed ANYmal across grass, gravel, snow, ice. The teacher-student pattern became standard.

3.4 Kumar-Fu-Pathak-Malik 2021 RMA (Berkeley)

Ashish Kumar, Zipeng Fu, Deepak Pathak, Jitendra Malik. RMA — Rapid Motor Adaptation. Train teacher with privileged dynamics input ; train adaptation module to predict from recent history. Deploy adaptation online with no gradient steps. A1 + Mini Cheetah on diverse terrains. Now the default sim-to-real architecture for legged.

3.5 Margolis-Yang-Paolini-Pathak-Kim 2023 MIT cheetah parkour

Gabriel Margolis, Ge Yang, Kartik Paigwar, Tao Chen, Pulkit Agrawal at MIT + CMU. Trained MIT Mini-Cheetah in Isaac Gym to do parkour: stair climbing, gap jumping, vertical leaps. Two-stage curriculum + DR + reward shaping. Open-sourced via Walk-These-Ways.

3.6 Cheng-Kumar-Pathak 2024 Berkeley extreme parkour

Xuxin Cheng, Kexin Shi, Ananye Agarwal, Deepak Pathak. Extreme parkour: 80 cm leaps, climbing 50 cm walls. Single end-to-end policy from proprioception + depth. Demonstrated Unitree A1 and Mini-Cheetah doing skateboarding-park-style obstacles.

3.7 Zhuang-Fu-Wang-Xu et al. 2024 Robot Parkour

Ziwen Zhuang et al. — separate parallel work, similar vein. Open-source release of parkour policies for Unitree Go1 and A1.

3.8 Humanoid RL 2023-2025

After quadrupeds, the humanoid wave:

  • OpenAI Dactyl (Andrychowicz et al. 2019/2020). Shadow Hand dexterous in-hand manipulation; PPO + LSTM + ADR (Automatic Domain Randomization).
  • Cassie / Digit (Agility Robotics + Crowley-Dao-Hurst 2022). 5 km outdoor run; PPO single end-to-end policy.
  • Atlas DRL (Boston Dynamics 2024). Production RL controller replacing MPC. Major update to commercial fleet.
  • Unitree H1 + G1 (Unitree 2024+). Open-source RL policies; community-trained checkpoints widely shared.
  • Berkeley HumanPlus (Cheng-Ji-Wang-Shi-Sferrazza 2024). Humanoid imitation from human video. SMPL retargeting.
  • OmniH2O (CMU + NVIDIA 2024). H1 whole-body teleop + retargeting.
  • Stanford ALOHA 2 + UMI (2024). Wearable teleop for humanoid imitation.

4. Manipulation via RL

4.1 Levine-Finn-Darrell-Abbeel 2016 GPS

Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel (Berkeley). Guided Policy Search — alternate between trajectory optimization (iLQG over local linear dynamics) and policy distillation. Trained PR2 robot to assemble Lego, hang clothes. Early demonstration that deep RL on real robots is feasible.

4.2 Andrychowicz 2017 HER

Marcin Andrychowicz, Filip Wolski, Alex Ray, et al. (OpenAI). Hindsight Experience Replay — relabel failed goals as achieved alternative goals. Solved Fetch-Push, Fetch-Slide, Fetch-Pick-And-Place from sparse rewards.

4.3 OpenAI Dactyl Rubik’s Cube 2019

Ilge Akkaya et al. (OpenAI). Shadow Hand 24-DoF. PPO + LSTM. Automatic Domain Randomization dynamically expands DR range with success. 13,000 years of simulated experience. Solved cube on hardware. Cited as the canonical proof of sim-to-real for dexterous.

4.4 Kalashnikov QT-Opt Google 2018

Dmitry Kalashnikov et al. (Google Brain). Real-world bin-picking RL. 7 KUKA arms running 24/7 for months; 800,000 grasp trials. Q-function + cross-entropy method action selection. Achieved 96% grasp success on novel objects. Demonstrated that real-world RL at scale works given enough robots + time.

4.5 RoboCat DeepMind 2023

Konstantinos Bousmalis et al. (DeepMind). Gato-style transformer trained on demonstrations from many robots + tasks. Self-improvement loop: deploy on robot, collect more data, retrain. Cross-embodiment positive transfer. Predecessor to industrial-scale VLAs.

4.6 Residual RL — Johannink-Bahl-Nair-Luo-Loquercio-Ross 2019

Tobias Johannink et al. (Sergey Levine + Pieter Abbeel groups). Residual policy on top of impedance controller for industrial insertion task. Reaches sub-mm precision with minutes of real-world data. The first compelling industrial demonstration of residual RL.

5. Practical math — training compute and sample efficiency

5.1 PPO compute budget for locomotion

EmbodimentSimParallel envsSteps to convergeWall-clockHardware
MIT Mini-CheetahIsaac Gym40961B4 hrRTX 3090
Unitree A1Isaac Gym40961B4 hrRTX 3090
ANYmal-CIsaac Sim40962B8 hrRTX 4090
Unitree H1 humanoidIsaac Lab40965B24 hrH100
Cassie (Agility)MuJoCo1024500M12 hrA100

5.2 Sample efficiency comparison

Algorithm     | Sample efficiency relative to DQN
DQN baseline  | 1.0
Rainbow       | 5×
SAC           | 10×
TD3           | 8×
PETS          | 30× (small tasks)
Dreamer V3    | 50× (some tasks)
MuZero        | 100×
EfficientZero | 500×
Offline RL    | n/a (uses existing data)
Imitation     | 10000× (if demonstrations available)

5.3 Memory and parallelism — Isaac Lab scaling

GPU         | Parallel envs | Step rate
RTX 3060    | 1024          | 250k steps/s
RTX 3090    | 4096          | 1.0M steps/s
RTX 4090    | 8192          | 1.5M steps/s
A100 80GB   | 16384         | 3.0M steps/s
H100 80GB   | 32768         | 6.0M steps/s
H200 144GB  | 65536         | 10M steps/s

5.4 Cross-embodiment data scaling (Open-X-Embodiment)

Trained onPer-embodiment improvement
Single embodiment, 10k demosbaseline
Add 5 more embodiments (50k mixed)+15%
Open-X full mixture (1M, 22 embodiments)+50%
+ LoRA fine-tune (50 task demos)+60-80%

5.5 VLA inference latency

Model        | Params | Inference rate
RT-2 55B     | 55B    | 1-3 Hz (server)
RT-2 5B      | 5B     | 5 Hz (server)
OpenVLA      | 7B     | 6 Hz (A100)
π0           | 3B     | 50 Hz (H100)
π0.5         | 3B     | 50 Hz (H100)
Octo-Base    | 93M    | 50 Hz (RTX 4090)
Octo-Small   | 27M    | 100 Hz (RTX 4090)
GR00T N1     | 2B     | 50 Hz (Jetson Thor)
Helix S1     | small  | 200 Hz (Figure 02 onboard)
Helix S2     | large  | 5-10 Hz (Figure 02 onboard)

5.6 Reward shaping for locomotion

Standard reward terms for a quadruped:

Typical weights: , , , , , survival bonus per step. Different teams use 8-30 reward terms; ablations consistently show 5-7 terms are sufficient with proper tuning.

5.7 Curriculum design

Standard curriculum stages:

  1. Flat ground, low velocity command.
  2. Flat ground, full velocity range.
  3. Small terrain perturbations.
  4. Stairs (up + down).
  5. Slopes 0-30°.
  6. Random rough terrain.
  7. Push disturbances.
  8. Payload mass changes.

Advance to next stage when success rate at current stage > 90% over 200 episodes.

6. Benchmark suites & simulators

6.1 Manipulation benchmarks

SuiteAuthorsEmbodimentTasksNotes
RLBenchShridhar et al. Imperial 2020Franka100+CoppeliaSim-based
robosuiteZhu-Lin et al. Stanford 2020Franka, Sawyer9 baseMuJoCo
ManiSkill / ManiSkill 3Mu et al. UCSD 2021 / 2024Various30+SAPIEN; large + curated
MetaWorldYu et al. Stanford 2019Sawyer50Meta-RL focus
D’Claw / D’Hand (ROBEL)Ahn-Yu et al. 2019CustomdexterousReal-world testbed
PerActShridhar Imperial 2022FrankalanguageVoxel-based; perceiver
CalvinMees et al. Freiburg 2022FrankalanguageLong-horizon language tasks
VLABenchvarious 2024mixedVLA evalStandardized VLA eval
LIBEROLiu et al. UT Austin 2023FrankacontinualLifelong learning

6.2 Locomotion benchmarks

SuiteAuthorsEmbodimentNotes
DM Control SuiteTassa et al. DeepMind 2018MuJoCo humanoid + cheetah + etcContinuous control standard
HumanoidBenchSferrazza-Huang-Lin Berkeley 2024HumanoidWhole-body benchmark
Walk These WaysMargolis-Agrawal MIT 2023Mini-CheetahOpen-source parkour
Quadruped-Gym (Isaac Lab)NVIDIA 2023ANYmal, UnitreeProduction-grade
Locomotion-BenchvariousCassie, Digit, A1Cross-embodiment
Loco-Manip Benchvarious 2024Quadruped + armMobile manipulation

6.3 Simulators (cross-reference with sim-to-real)

SimulatorOwnerStrengthRL-class
Isaac Lab / Isaac SimNVIDIAGPU-resident, photorealisticManipulation + Locomotion
MuJoCo MJXDeepMindJAX, GPU-vectorizedManipulation + Locomotion
BraxGooglePure JAXLocomotion + research
GenesisCMU 2024Unified rigid + soft + fluidEmerging
robosuiteStanford ARISEMuJoCo + manipulation focusManipulation
ManiSkill (SAPIEN)UCSDManipulation benchmarksManipulation
RLBenchImperialCoppeliaSim manipulationManipulation
MuJoCo MPCHowell-Lutter-Posa-Tedrake DeepMind 2022Real-time MPC + RLHybrid
ROBELAhn-Yu et al. 2019Real-world cheap testbedReal-world RL

7. Design heuristics

  • Start with PPO. It’s the workhorse; try other algorithms only after PPO clearly fails. SAC is the alternative for off-policy needs.
  • Spend 80% of time on the reward function. A good policy on a bad reward solves the wrong task. Iterate the reward in sim until visual behavior matches intent.
  • Normalize observations. Subtract mean, divide by std. Otherwise large-magnitude inputs dominate gradients.
  • Use action smoothing. Penalize in reward; clip action change rate. Otherwise sim-only “vibration” exploits emerge.
  • Train at deployment control rate. A policy trained at 1000 Hz that runs at 50 Hz on hardware fails. Match rates.
  • Domain randomization is non-negotiable. Skip DR → sim policy fails on hardware. Start with literature defaults; tighten when sysid improves.
  • Use teacher-student. Privileged-teacher (sees true dynamics) → student (sees only deployable observations). Reliably better than direct student training.
  • Hand-design observations carefully. Including a feature that’s noisy on hardware but clean in sim → policy keys on it and fails. If in doubt, omit.
  • Curriculum > flat training. Linear curriculum (slowly increase difficulty) beats curriculum-free training by 2-10× in wall-clock.
  • Save policies often + log everything. RL training is unstable; the best policy is often not the latest. Keep a rolling top-5 and replay videos.
  • Test the policy on the failure modes you expect. Specifically test stairs, payload, pushes. Headline metrics hide tail-case failures.
  • Don’t fine-tune VLAs from scratch. Start with LoRA + small learning rate. Full fine-tune destroys priors.

8. Failure modes & debugging

  • Reward hacking. Policy maximizes reward proxy but doesn’t accomplish task. Detect: high reward but visually incorrect behavior. Fix: tighter reward design, sparse-reward shaping with intrinsic motivation.
  • NaN losses. Optimization diverges. Cause: too-large learning rate, normalization issue, unstable critic. Fix: gradient clipping, value-loss clipping, lower LR.
  • Sim-only exploit. Policy works in sim, fails in real. Cause: physics-engine-specific quirks. Fix: DR + sysid + match control rate + action smoothness.
  • Catastrophic forgetting. Policy good at task A, retrained on task B, fails A. Fix: PPO trust-region; replay buffer with old episodes; multi-task training.
  • Mode collapse. Continuous-action policy outputs same action regardless of state. Cause: low entropy, too-deterministic policy. Fix: entropy bonus, SAC with auto-temperature.
  • Slow training. PPO not converging after expected steps. Cause: reward poorly shaped, normalization missing, network too small/big. Fix: sanity-check on toy task first, scale gradients.
  • OOD generalization failure. Train on terrain A, fail on terrain B. Fix: more DR, broader curriculum, terrain randomization.
  • Critic divergence. Value estimates explode. Cause: bootstrapping with bad target. Fix: target network, value-loss clipping, GAE.
  • Action saturation. Policy commands joint torques at limit always. Fix: torque penalty in reward, clipping in environment, lower action scale.
  • Real-world inverse-kinematics blow-up. Policy commands end-effector position outside reachable workspace. Fix: kinematic feasibility filter, train with workspace constraints.
  • Failure to recover from disturbance. Push the robot — it falls. Fix: train with random push disturbances in DR.

9. Case studies

9.1 OpenAI Dactyl Rubik’s Cube (2018-2019)

Akkaya et al. OpenAI. Shadow Hand 24-DoF, MuJoCo sim, PPO + LSTM + ADR. Trained ~13,000 simulated years (8 months wall-clock, ~$3M GPU). Solved cube on hardware ~60% of the time. The single most-cited sim-to-real RL milestone. Later post-mortems noted narrow generalization (works on Cubes, not on novel objects), but the technique (PPO + LSTM memory + ADR) became standard.

9.2 ANYmal-C in production (ANYbotics 2020-2026)

ANYbotics (ETH spin-out) produces ANYmal-C/D for industrial inspection. Initial 2020 controllers were hand-coded MPC. 2022+ RL policies trained in RaiSim, distilled from teacher-student, deployed to customer fleets (BP Mad Dog, Shell Nyhamna, Statkraft, ÖBB rail). >99% session completion rate. Demonstrates that sim-to-real RL is production-grade.

9.3 Cassie 5 km outdoor run (Crowley-Dao-Hurst 2022)

Helei Duan, Jeremy Dao, Jonathan Hurst (Agility Robotics / OSU). Cassie biped runs 5 km on University of Michigan’s Wolverine track at ~5 m/s. Single end-to-end RL policy (PPO) trained in MuJoCo + Isaac Gym. DR over friction, mass, motor delay, inclines, pushes. First demonstration of end-to-end RL bipedal running outdoors. Precursor to all subsequent humanoid RL work.

9.4 MIT Mini-Cheetah parkour (Margolis et al. 2023)

Gabriel Margolis, Ge Yang, Kartik Paigwar, Tao Chen, Pulkit Agrawal (MIT). Two-stage curriculum: blind locomotion, then vision-guided. Stairs, gaps up to 60 cm, vertical leaps. Open-sourced “Walk These Ways” (WTW) — became reference codebase. Sparked the parkour wave (Cheng et al. Berkeley, Zhuang et al., etc.).

9.5 OpenAI Rubik’s Cube → Tesla Optimus (claimed) / Figure / 1X / Apptronik

The Dactyl + Cassie + ANYmal recipe (PPO + DR + teacher-student + Isaac Lab) is what every humanoid program now uses. Tesla Optimus walks outdoors (claimed RL+MPC hybrid); Figure 02 + Helix combines VLA (S2) + reactive RL controller (S1); 1X NEO Gamma uses cable-driven hardware + learned policies; Apptronik Apollo similar. As of 2026 most humanoid OEMs are converging on the NVIDIA Isaac Lab + GR00T pipeline.

9.6 Diffusion Policy + ALOHA (Stanford 2023-2024)

Cheng Chi, Siyuan Feng, Yilun Du et al. (Columbia + Stanford). Diffusion Policy (RSS 2023). Stanford Mobile ALOHA (2024). Bimanual dexterous tasks (folding, cooking, dishwashing) via teleop + diffusion-policy + co-training. Demonstrated 30-minute end-to-end household task completion. Watershed for IL+RL hybrid manipulation.

9.7 Open-X-Embodiment + RT-X (50+ labs 2023)

Padalkar et al. + 50+ labs pooled their robot data → 1M trajectories, 22 embodiments, 527 skills. RT-1-X trained on the union improved over per-lab RT-1 by ~50%. Established cross-embodiment training as standard. Foundation for OpenVLA, Octo, π0.

9.8 OpenVLA (Stanford / TRI 2024)

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al. (Stanford + TRI + UC Berkeley + Google). 7B-param VLA. Llama2-7B language backbone + DINOv2 + SigLIP fused vision. Trained on OXE. Open weights + LoRA fine-tuning recipe. First widely-deployable open VLA. Single RTX 4090 can LoRA-fine-tune. Powered hundreds of academic + startup deployments through 2025.

9.9 π0 + π0.5 (Physical Intelligence 2024-2025)

Sergey Levine, Karol Hausman, Chelsea Finn-affiliated team founded Physical Intelligence (PI). π0: 3B-param flow-matching VLA. Trained on ~1M demonstrations from 7 robot platforms. 50 Hz inference. Public demos: laundry folding, table bussing, packing groceries. Zero-shot generalization to embodiments not in training data (1X EVE, after training only on Franka data). π0.5 (2025): generalization beyond training distribution. Open weights for research. PI’s commercial demos in 2025 spawned dozens of integrator deployments.

9.10 GR00T N1 + Cosmos (NVIDIA 2024-2025)

GR00T (Generalist Robot 00 Technology): NVIDIA’s humanoid foundation model. Trained on a mixture of human-MoCap data, sim trajectories, real-robot demonstrations. Embodiment partners: Unitree H1/G1, Apptronik Apollo, 1X NEO, Fourier GR-1, Boston Dynamics electric Atlas. Cosmos: NVIDIA’s world-foundation-model for synthetic data generation. GR00T + Cosmos + Isaac Lab + Jetson Thor is NVIDIA’s full-stack humanoid offering — sell the compute and the model to humanoid OEMs lacking in-house RL/IL talent.

9.11 Figure Helix (2025)

Figure AI’s onboard VLA for Figure 02 humanoid. Dual-system: S1 (200 Hz reactive ViT-style) + S2 (5-10 Hz large VLM reasoning). Demo: two humanoids cooperatively put groceries away in a kitchen, with continuous voice instruction. First public humanoid VLA running fully onboard at production rates. Sparked the humanoid foundation model deployment wave at Figure, 1X, Apptronik, NVIDIA.

9.12 Toyota Research Institute Diffusion Policy + Hyperdiffusion (2024-2025)

TRI built a Diffusion Policy variant (sometimes called “Hyperdiffusion”) for robotic manipulation. Trained on ~1000 teleop demos per task. Demonstrated 200+ tasks deployed on TRI’s research arms. Released open-weight checkpoints for several manipulation primitives. Validated diffusion policies in industrial-strength integration.

9.13 Decision Transformer for offline RL (Chen et al. UC Berkeley 2021)

Lili Chen, Kevin Lu, Aravind Rajeswaran et al. Cast offline RL as sequence modeling: condition on return-to-go; predict next action with causal transformer. Strong on D4RL benchmarks. Predecessor to all transformer-based VLA work — same architectural primitive.

9.14 Wu-Escontrela-Pathak-Shen 2022 — Dreamer on real quadruped

Philipp Wu, Alejandro Escontrela, Deepak Pathak, William Yang. Trained DreamerV2 (Hafner 2020) latent-imagination policy on a real Unitree A1 quadruped from raw images. Walking learned in 1 hour of real-world data. Demonstrates model-based RL can be sample-efficient enough for direct real-world training without sim.

10. Safe and constrained RL

10.1 Constrained MDPs

A CMDP adds constraint signals and budgets to the standard MDP. The policy must satisfy while maximizing reward. Methods:

  • CPO — Constrained Policy Optimization (Achiam 2017). Trust-region method with constraint line-search.
  • Lagrangian methods. Treat constraints with dual variables; jointly optimize. Standard in safe-RL libraries (Safe-RLlib, OmniSafe).
  • Reward-augmentation. Push constraint violation into the reward; brittle but simple.
  • CBF — Control Barrier Functions (Ames-Coogan-Egerstedt 2019). Filter actions through a safety layer that guarantees the system stays in a safe set.

Used in industrial deployments where regulatory or physical safety constraints are hard. Anymal’s RL controller includes CBF-style joint limit + torque limit filters at deployment.

10.2 Safe exploration during training

  • Recoverable states. Start in safe state; abort and reset when leaving recovery region.
  • Pretrained safety filter. Train a critic that predicts “will this action lead to a failure within N steps?”; only sample safe actions during exploration.
  • Human-in-the-loop. Operator vetoes dangerous actions; learns over time. Used in surgical-robot RL projects.

10.3 Risk-sensitive RL

Standard RL optimizes the expectation. Risk-sensitive variants:

  • CVaR (Conditional Value at Risk) — optimize the expected return conditional on the worst %.
  • Distributional RL (Bellemare-Dabney-Munos 2017 C51, IQN, QR-DQN). Learn the full return distribution, not just the mean.
  • Worst-case RL. Minimax over adversarial dynamics. Pinto’s RARL.

11. Skills, hierarchy, and long-horizon RL

11.1 Options framework (Sutton-Precup-Singh 1999)

An option is a temporally-extended action: where is the initiation set, the option’s policy, the termination function. The high-level policy chooses options; each option runs until termination. Decomposes long-horizon tasks.

11.2 Hierarchical RL

  • HRL via subgoal generation. High-level policy proposes subgoals; low-level policy reaches them. HIRO (Nachum et al. 2018), HAC (Levy 2019).
  • MAXQ decomposition (Dietterich 2000). Recursive task graph.
  • Skill-conditioned policies (DIAYN, Eysenbach 2019). Discover skills via mutual-information maximization between latent skill and trajectories.

11.3 Diffusion + skill chaining

Recent trend: a diffusion policy generates an action chunk (50-100 steps); the system re-plans at chunk boundaries. Effectively a 2-level hierarchy without explicit subgoals. Chi et al. Diffusion Policy + ALOHA ACT are the canonical examples; π0 generalizes to flow-matching action chunks.

11.4 Language-conditioned policies

LLMs ground task decomposition. SayCan (Brohan-Ichter et al. Google 2022): LLM proposes high-level plan; affordance model picks executable next step; low-level skill policy executes. Code-as-Policies (Liang-Huang-Liu-Brown-Yang-Brohan 2023): LLM emits Python code calling primitives. Voxposer (Huang-Wang-Mendonca-Held-Pathak 2023): LLM generates value maps for downstream MPC.

12. Foundation-model details

12.1 Action representation

VLA / generalist policies vary in how they emit actions:

  • Discrete tokens (RT-2, OpenVLA). Discretize each action dimension into 256 bins; the model emits a token sequence per timestep. Compatible with LM training infrastructure.
  • Diffusion (Octo, Diffusion Policy). Predict noise iteratively; produce continuous action chunks.
  • Flow matching (π0). Learn the velocity field of a continuous-time flow from noise to action. Faster than diffusion (4-8 steps vs 16-100).
  • Direct regression (ACT, simple BC). Predict actions in a single forward pass. Cheap but cannot model multi-modal distributions.
  • Mixture-of-Gaussians (Robomimic baselines). K-mode GMM head; multi-modal but limited K.

12.2 Vision backbones

  • DINOv2 (Oquab et al. Meta 2023). Self-supervised ViT; widely used in OpenVLA, π0.
  • SigLIP (Zhai-Mustafa-Kolesnikov-Beyer Google 2023). Sigmoid-loss CLIP variant; better than CLIP for retrieval.
  • CLIP ViT-L (Radford et al. OpenAI 2021). Original vision-language alignment.
  • R3M (Nair-Rajeswaran-Kumar-Finn-Gupta Stanford 2022). Pretrained on Ego4D for manipulation.
  • VC-1 / VIP (Majumdar et al. Meta 2023). Visual cortex for embodied AI.
  • EVA / EVA-02 (Fang-Wang-Xie-Sun-Wu-Wang-Cao 2023). Open ViT pretraining.

12.3 Cross-embodiment recipe

The Open-X-Embodiment recipe that became standard:

  1. Tokenize observations: image (ViT patches), text (LLaMA tokens), proprioception (linear projection).
  2. Tokenize actions per dimension to 256 bins; emit one token per dim per timestep.
  3. Encode robot embodiment with a learnable embedding ID; concatenate to observation tokens.
  4. Train with next-token loss on shuffled cross-embodiment data.
  5. At deployment, decode tokens to continuous actions via inverse bin mapping.

The single shared model handles 22 embodiments with the same parameters — cross-embodiment generalization emerges from training distribution diversity.

12.4 Compute budget for VLA training

ModelParamsPretraining dataComputeWall-clock
RT-135M130k demos8 TPU-v3days
RT-2 5B5BOXE + internet64 TPU-v4weeks
OpenVLA 7B7BOXE64 A10014 days
Octo Base93MOXE8 H100days
π03B~1M demos × 7 platforms256 H100weeks
GR00T N1~2Bmixed sim + real + MoCap1024 H100weeks

Fine-tuning is dramatically cheaper. OpenVLA LoRA fine-tune: 50-200 demos, 4-24 hours on a single RTX 4090.

13. Practical training tips and reward-design patterns

13.1 Standard locomotion reward template

# Velocity tracking
r_v   = w_v * exp(-||v_cmd - v_actual||^2 / sigma_v)
# Yaw tracking
r_yaw = w_yaw * exp(-||yaw_cmd - yaw_actual||^2 / sigma_yaw)
# Energy / smoothness penalties
r_t   = -w_t * ||tau||^2
r_a   = -w_a * ||a_dot||^2
r_jerk= -w_jerk * ||a_ddot||^2
r_q   = -w_q * ||q - q_default||^2  # nominal posture
r_orient = -w_o * ||base_roll_pitch||^2
# Foot contact / clearance
r_foot = w_foot * sum(foot_clearance > threshold during swing)
# Survival bonus
r_alive = w_alive  # per step alive
# Terminal cost on falls
r_term = -w_fall if base_height < threshold or base_orient_angle > limit

Tune weights iteratively by inspecting policy behavior in sim. Common pitfall: too-high velocity weight → policy ignores torque cost → fast but jittery.

13.2 Standard manipulation reward template

# Reach phase
r_reach = -||p_ee - p_goal||
# Grasp phase
r_grasp = +w_grasp if contact and object_lifted
# Place phase
r_place = -||p_obj - p_goal||
# Action smoothness
r_smooth = -w_smooth * ||a_dot||^2
# Collision penalty
r_collision = -w_col if collision_external
# Sparse success
r_success = +R_terminal if task_complete

13.3 PPO hyperparameter defaults (legged locomotion)

learning_rate: 3e-4 (linear decay to 1e-5)
discount gamma: 0.99
GAE lambda: 0.95
clip range: 0.2
value loss coef: 1.0
entropy coef: 0.005
batch size: 24576 (4096 envs × 6 steps minibatch)
training epochs: 5
KL target (early-stop): 0.01
network: 3 hidden layers, [512, 256, 128] units, ELU activation

13.4 SAC hyperparameter defaults (manipulation)

learning_rate: 3e-4
batch size: 256
buffer size: 1e6
gamma: 0.99
tau (target update): 0.005
auto entropy tuning: on
target entropy: -dim(action)
warmup steps: 10000

14. Tooling ecosystem

14.1 Open-source training libraries

LibraryMaintainerUse
Stable-Baselines3communityReliable PPO/SAC/TD3 for CPU/GPU
RLlib (Ray)AnyscaleProduction-grade distributed RL
CleanRLCosta Huang communitySingle-file readable implementations
TianshouTsinghuaMLModular PyTorch RL
AcmeDeepMindResearch-focused
Sample FactorycommunityThroughput-optimized async PPO
RSL_RLRobotic Systems Lab ETHPPO for legged; standard in Isaac Lab
RL GamesNVIDIAUsed in Isaac Gym / Isaac Lab
skrlcommunityIsaac Lab-compatible

14.2 Foundation-model toolkits

LibraryProviderUse
LeRobotHugging FaceEnd-to-end IL pipeline (ACT, Diffusion Policy, π0)
openpiPhysical Intelligenceπ0 / π0.5 inference + fine-tune
openvlaStanford TRIOpenVLA inference + LoRA
octoUC BerkeleyOcto training + checkpoints
robosuiteStanfordManipulation benchmarks on MuJoCo
Diffusion Policy refColumbiaReference Diffusion Policy code
TRLHugging FaceRLHF library (LLM RL)

14.3 Deployment / inference

ToolUse
ONNX RuntimeCross-platform inference
TensorRTNVIDIA-optimized inference
TorchScriptPyTorch deployment
OpenVINOIntel CPU/GPU inference
TFLite / CoralEdge inference (small models)
Triton Inference ServerNVIDIA server-side serving

14.4 Logging and experiment tracking

ToolUse
Weights & BiasesCloud experiment tracking
TensorBoardLocal visualization
MLflowOpen-source experiment tracking
AimOpen-source ML tracking
NeptuneCloud tracking with collaboration

15. Open challenges (2026)

  • Sample efficiency in real world. Sim-to-real works but the sim isn’t always good enough. Real-world RL on legged + manipulators is still slow.
  • Safe exploration. No widely-deployed solution for safe random exploration on expensive humanoids.
  • Long-horizon multi-task. Tasks > 60 s with multiple skills still require hierarchy + scripting; pure RL fails to assemble.
  • Generalization across embodiments. OXE shows positive transfer but a Franka policy still doesn’t directly work on a UR5 without fine-tuning.
  • Language conditioning for compositional behaviors. VLAs understand simple commands; compositional (“first do X, then if Y, do Z”) remains brittle.
  • Reward specification. Hand-designed rewards are brittle; preference learning + LLM-generated rewards (Eureka, Yu et al. 2023) are early-stage.
  • Real-time inference on edge hardware. Large VLAs need server GPUs. Distillation + dual-system (Helix S1/S2) are partial answers; smaller native models are an open frontier.
  • Robustness to adversarial perturbations. Policies can be fooled by trivial visual changes. Adversarial training is research-grade, not production-ready.

16. Glossary

  • A3C / A2C — Asynchronous (or Synchronous) Advantage Actor-Critic (Mnih 2016).
  • Action chunking — predicting + executing multi-step action sequences (ACT, Diffusion Policy).
  • ACT — Action Chunking with Transformers — Zhao 2023 ALOHA standard.
  • ADR — Automatic Domain Randomization — adaptive DR scheduling (OpenAI Dactyl 2019).
  • BC — Behavior Cloning — supervised learning from demonstrations.
  • CMDP — Constrained MDP — adds budget constraints to standard MDP.
  • CPO — Constrained Policy Optimization — Achiam 2017.
  • CTDE — Centralized Training, Decentralized Execution — MARL paradigm.
  • CQL — Conservative Q-Learning — offline RL (Kumar 2020).
  • DAgger — Dataset Aggregation — interactive IL (Ross 2011).
  • DDPG — Deep Deterministic Policy Gradient — Lillicrap 2015.
  • Decision Transformer — sequence-modeling offline RL (Chen 2021).
  • Diffusion Policy — generative-model action policy (Chi 2023).
  • DR — Domain Randomization — vary sim parameters during training.
  • DREAMER — model-based latent-imagination RL (Hafner 2019-2023).
  • GAE — Generalized Advantage Estimation — variance-reduced advantage estimator.
  • GAIL — Generative Adversarial Imitation Learning — Ho-Ermon 2016.
  • GPS — Guided Policy Search — Levine-Abbeel 2016.
  • HER — Hindsight Experience Replay — Andrychowicz 2017.
  • HRL — Hierarchical RL — multi-level policy decomposition.
  • IL — Imitation Learning — learning from demonstrations.
  • IQL — Implicit Q-Learning — offline RL (Kostrikov 2021).
  • IRL — Inverse Reinforcement Learning — infer reward from demonstrations.
  • LoRA — Low-Rank Adaptation — efficient fine-tuning (Hu 2021).
  • MDP — Markov Decision Process — standard RL formulation.
  • MPC — Model Predictive Control — receding-horizon optimization.
  • MuZero — learned-dynamics + MCTS (Schrittwieser 2020).
  • OXE — Open-X-Embodiment — cross-embodiment robot dataset (2023).
  • PETS — Probabilistic Ensembles + Trajectory Sampling — model-based RL (Chua 2018).
  • PILCO — Gaussian-Process model-based RL (Deisenroth 2011).
  • PPO — Proximal Policy Optimization — Schulman 2017; the workhorse.
  • QT-Opt — large-scale real-world Q-learning (Kalashnikov 2018).
  • RL — Reinforcement Learning — sequential decision-making via reward.
  • RMA — Rapid Motor Adaptation — Kumar 2021 sim-to-real bridge.
  • RT-1 / RT-2 / RT-X — Robotics Transformer foundation models (Google 2022-2023).
  • SAC — Soft Actor-Critic — max-entropy off-policy (Haarnoja 2018).
  • TD3 — Twin Delayed DDPG — Fujimoto 2018.
  • TRPO — Trust Region Policy Optimization — Schulman 2015.
  • VLA — Vision-Language-Action model — multimodal generalist policy.
  • VLM — Vision-Language Model — image + text foundation model.

Further reading

  • Sutton R.S., Barto A.G. (2018) Reinforcement Learning: An Introduction, 2nd ed. MIT Press.
  • Mnih V. et al. (2015) “Human-level control through deep reinforcement learning.” Nature 518.
  • Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O. (2017) “Proximal Policy Optimization Algorithms.” arXiv:1707.06347.
  • Schulman J., Levine S., Moritz P., Jordan M.I., Abbeel P. (2015) “Trust Region Policy Optimization.” ICML.
  • Haarnoja T., Zhou A., Abbeel P., Levine S. (2018) “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” ICML.
  • Fujimoto S., van Hoof H., Meger D. (2018) “Addressing Function Approximation Error in Actor-Critic Methods” (TD3). ICML.
  • Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D. (2016) “Continuous control with deep reinforcement learning” (DDPG). ICLR.
  • Andrychowicz M., Wolski F., Ray A., et al. (2017) “Hindsight Experience Replay.” NeurIPS.
  • Deisenroth M.P., Rasmussen C.E. (2011) “PILCO: A Model-Based and Data-Efficient Approach to Policy Search.” ICML.
  • Chua K., Calandra R., McAllister R., Levine S. (2018) “Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models” (PETS). NeurIPS.
  • Hafner D., Lillicrap T., Norouzi M., Ba J. (2020) “Mastering Atari with Discrete World Models” (Dreamer V2). ICLR.
  • Hafner D., Pasukonis J., Ba J., Lillicrap T. (2023) “Mastering Diverse Domains through World Models” (Dreamer V3). arXiv:2301.04104.
  • Schrittwieser J. et al. (2020) “Mastering Atari, Go, chess and shogi by planning with a learned model” (MuZero). Nature 588.
  • Tan J., Zhang T., Coumans E., Iscen A., Bai Y., Hafner D., et al. (2018) “Sim-to-Real: Learning Agile Locomotion For Quadruped Robots.” RSS.
  • Hwangbo J., Lee J., Dosovitskiy A., et al. (2019) “Learning agile and dynamic motor skills for legged robots.” Science Robotics 4.
  • Lee J., Hwangbo J., Wellhausen L., Koltun V., Hutter M. (2020) “Learning quadrupedal locomotion over challenging terrain.” Science Robotics 5.
  • Kumar A., Fu Z., Pathak D., Malik J. (2021) “RMA: Rapid Motor Adaptation for Legged Robots.” RSS.
  • Margolis G.B., Yang G., Paigwar K., Chen T., Agrawal P. (2024) “Rapid Locomotion via Reinforcement Learning.” IJRR.
  • Cheng X., Kumar A., Pathak D. (2024) “Extreme Parkour with Legged Robots.” ICRA.
  • OpenAI, Andrychowicz M., et al. (2019) “Solving Rubik’s Cube with a Robot Hand.” arXiv:1910.07113.
  • Kalashnikov D., Irpan A., Pastor P., et al. (2018) “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation.” CoRL.
  • Brohan A. et al. (2022) “RT-1: Robotics Transformer for Real-World Control at Scale.” arXiv:2212.06817.
  • Brohan A. et al. (2023) “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv:2307.15818.
  • Padalkar A. et al. (2023) “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv:2310.08864.
  • Kim M.J. et al. (2024) “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv:2406.09246.
  • Octo Model Team (2024) “Octo: An Open-Source Generalist Robot Policy.” arXiv:2405.12213.
  • Physical Intelligence (2024) “π0: A Vision-Language-Action Flow Model for General Robot Control.”
  • Chen L., Lu K., Rajeswaran A., et al. (2021) “Decision Transformer: Reinforcement Learning via Sequence Modeling.” NeurIPS.
  • Janner M., Li Q., Levine S. (2021) “Offline Reinforcement Learning as One Big Sequence Modeling Problem.” NeurIPS.
  • Kumar A., Zhou A., Tucker G., Levine S. (2020) “Conservative Q-Learning for Offline Reinforcement Learning” (CQL). NeurIPS.
  • Kostrikov I., Nair A., Levine S. (2021) “Offline Reinforcement Learning with Implicit Q-Learning” (IQL). arXiv:2110.06169.
  • Levine S., Finn C., Darrell T., Abbeel P. (2016) “End-to-End Training of Deep Visuomotor Policies.” JMLR 17.
  • Sferrazza C., Huang D.M., Lin X., Lee Y., Abbeel P. (2024) “HumanoidBench.” arXiv:2403.10506.

Adjacent