Imitation Learning — LfD, Diffusion Policies, VLA Models for Robotics
Training robots from demonstration rather than reward: behavior cloning (BC), DAgger, inverse RL, generative-model policies (Diffusion Policy, RDT-1B, ACT), and the 2023–2026 generation of vision-language-action (VLA) models — RT-1, RT-2, RT-X, OpenVLA, π0, π0.5, Octo, Helix. These now constitute the dominant paradigm for general-purpose manipulation: a single network trained on tens of thousands of robot demonstrations (Open-X-Embodiment, DROID, BridgeData V2) acts across embodiments and tasks with text or image goals.
See also
- rl-for-control
- sim-to-real
- teleoperation-haptics
- manipulator-design
- mobile-manipulation
- computer-vision-robotics
- end-effectors
- bin-picking
1. At a glance
Imitation learning (IL) trains a policy π(a | o) to mimic expert demonstrations D = {(o_t, a_t)} without reward signal. The simplest form, behavior cloning (BC), is supervised learning: minimize cross-entropy / MSE between policy and expert action. Despite being the oldest method (Pomerleau 1989, ALVINN), BC has come back as the workhorse of 2023–2026 robot learning, on the back of three changes: (1) datasets large enough that compounding distribution shift no longer dominates, (2) generative-model policies (diffusion, transformer) that capture multimodal expert behavior, and (3) shared cross-embodiment internet-scale pretraining (vision-language models).
The 2024–2026 robot foundation-model wave is the consequence:
- RT-1 (Brohan 2022): 35M params, trained on 130k demos of 13 skills × 700 tasks, robot transformer for Everyday Robots’ mobile manipulator.
- RT-2 (Brohan 2023): co-finetune a PaLM-E or PaLI-X VLM on robot trajectories; output text-tokenized actions. 5B–55B params.
- Open-X-Embodiment / RT-X (2023): 22 embodiments, 527 skills, ~1M trajectories aggregated across 21 institutions; established the cross-embodiment dataset.
- OpenVLA (Kim 2024): 7B param open VLA released by Stanford / TRI / UC Berkeley / Google.
- π0 + π0.5 (Physical Intelligence 2024–2025): 3B param flow-matching VLA, runs at 50 Hz; π0.5 adds generalization beyond training distribution.
- Octo (Berkeley 2024): 27M / 93M open-source generalist diffusion transformer.
- Helix (Figure 2024): humanoid VLA running on Figure 02 dual-system at 200 Hz.
- GR00T N1 / N1.5 (NVIDIA 2025): humanoid foundation models.
This shift means a robot built today inherits prior demonstrations from 100+ other robots. The architectural question for a new lab is no longer “what’s our policy” but “do we fine-tune OpenVLA or train from scratch?”
Where it sits. Manipulator / mobile base is the embodiment; teleop rigs (ALOHA, GELLO, leader-follower, VR) collect demonstrations; vision encoders (DINOv2, CLIP, SigLIP, ViT) ground the policy; sim augments scarce real demos via co-training; RL fine-tunes the IL policy or replaces it for explore-heavy tasks.
First ask. Are demonstrations cheap or expensive? Cheap (kinesthetic / leader-follower teleop, < 10/demo, scarce expert) → leverage VLA pretraining. Is the action distribution multimodal? Pick-and-place from arbitrary clutter → yes; switch to diffusion / VQ-discrete-token policies. Unimodal (line following, single-mode dynamics) → BC with MLP / GMM head suffices. Single task or family? Single → BC fine-tune. Family with language conditioning → VLA. Open-loop fine? If the task has tight tolerance (insertion, ironing), pair BC with force feedback or residual RL.
2. First principles
2.1 Behavior cloning (BC)
Given expert dataset D = {(o_t, a_t)} drawn from expert distribution d^π_E, fit π_θ to minimize:
L_BC(θ) = E_{(o,a)~D} [−log π_θ(a | o)] (discrete actions) L_BC(θ) = E_{(o,a)~D} [||π_θ(o) − a||²] (continuous, deterministic)
Limitations:
- Distribution shift: π_θ may visit states o’ not in D, and there π_θ has no signal. Errors compound as O(T²) over horizon T (Ross 2010).
- Multi-modal experts: if experts choose left or right with equal frequency, MSE-fit policy outputs the mean — straight ahead, often catastrophic.
- Causal confusion (de Haan 2019): policy keys on spurious correlates rather than the true cause, fails OOD.
2.2 DAgger (Ross 2011)
Dataset Aggregation. Iterate:
- Train π_i on aggregated dataset D_i.
- Roll out π_i, query expert for actions at visited states.
- Add (state, expert-action) pairs to D_{i+1} = D_i ∪ {new}.
- Repeat.
Recovers O(T) regret instead of O(T²). Requires online expert queries — costly in the real world but feasible with simulators or with cheap recovery (kinesthetic correction). DAgger and its variants (HG-DAgger, SafeDAgger) are still standard for high-precision tasks where BC fails.
2.3 Inverse reinforcement learning (IRL)
Rather than copy actions, infer the reward the expert is optimizing, then RL against that reward. Approaches:
- Maximum entropy IRL (Ziebart 2008): assume expert is exponentially-biased toward higher-reward trajectories.
- Generative adversarial imitation learning (GAIL, Ho 2016): train policy + discriminator simultaneously; discriminator distinguishes expert from policy, policy maximizes “fooling” the discriminator.
- Adversarial Inverse RL (AIRL, Fu 2018): extract a reward function from GAIL discriminator.
IRL has tractable issues (reward is underdetermined; multiple rewards fit any demonstrations); largely supplanted by direct policy methods in 2020+.
2.4 Multi-modal policies: GMM, MDN, diffusion, VQ
When the expert action distribution at a given state is multimodal (e.g., “go around obstacle left or right”), unimodal policies fail. Approaches:
- Gaussian Mixture Model (GMM) head: output K means + covariances + mixing weights. Used in PROMP, ALOHA-ACT’s predecessor.
- Mixture Density Network (MDN, Bishop 1994): similar; neural-net outputs of GMM params.
- VQ-VAE quantized actions (RT-1, RT-2): discretize the action space into N=256 bins per dim; cross-entropy loss; multimodal naturally.
- Diffusion policy (Chi 2023): the policy is a denoising network that iteratively refines noise into action; trained by score-matching. Captures arbitrary action distributions.
- Flow matching (Lipman 2023): a continuous-time variant of diffusion; π0 uses flow matching for speed.
- Categorical autoregressive (RT-2, OpenVLA): tokenize actions; predict token-by-token like a language model.
2.5 Diffusion policy (Chi et al. 2023)
The seminal paper. Action a is treated as a noisy variable; learn ε_θ(a^k, o, k) that predicts the noise added at step k. At inference, start from Gaussian noise a^K, iteratively denoise to a^0 = clean action:
a^{k-1} = a^k - ε_θ(a^k, o, k) * step_size + noise(k)
Properties:
- Captures multimodal distributions inherently.
- Conditions on observation o (image + proprioception).
- Outputs an action chunk H steps long, executed open-loop. Receding horizon: predict, execute K < H, predict again.
- 100 denoising steps at training; 16 at inference (DDIM); 4–8 with consistency models or flow-matching.
Performance: ~70% improvement over LSTM-GMM on Robomimic benchmarks; standard for ALOHA-style two-arm tasks since 2024.
2.6 Action chunking (ACT, Zhao 2023)
ALOHA team’s transformer-based BC variant. Key choices:
- Conditional VAE (CVAE) decoder over action chunks (H = 100 time-steps at 50 Hz).
- Encoder sees full episode; decoder sees current observations + style latent.
- Temporal ensembling: at runtime, average predictions from multiple recent chunks at each timestep — smooths execution.
Cheaper and faster to train than diffusion policy; competitive for single-trajectory tasks. Standard for bimanual ALOHA / Aloha-Mobile / TidyBot-style work.
2.7 Vision-Language-Action models (VLAs)
Architecture (RT-2 / OpenVLA / π0 lineage):
- Vision encoder (e.g., DINOv2, SigLIP, CLIP ViT-L) → image tokens.
- Optional language tokenization (T5, LLaMA tokenizer) → text tokens.
- Cross-attention transformer / LLM backbone (PaLI / LLaMA-2 / Mistral) → fused tokens.
- Action head: discrete tokens (RT-2: 256-bin per dim) OR flow-matching head (π0) OR diffusion head (Octo).
Training: co-finetune on (a) internet image-text pairs, (b) robot trajectories with language instructions. The internet data provides semantic grounding (“the red block”, “the blue mug”); robot data provides action grounding.
Inference latency:
- RT-2 55B: 1-3 Hz (slow; needs server GPU).
- OpenVLA 7B: 6 Hz on A100.
- Octo-Base 93M: 50 Hz on consumer GPU.
- π0 3B: 50 Hz (custom flow-matching).
- Helix (Figure): split System 1 (200 Hz reactive) + System 2 (5-10 Hz reasoning) on humanoid onboard compute.
2.8 Cross-embodiment learning
Open-X-Embodiment (Padalkar 2023) showed that pooling demonstrations across 22 robot embodiments (Franka, UR5, xArm, ALOHA, Spot, Stretch, Fetch, etc.) gives positive transfer — training on the mixture improves per-embodiment performance over training only on that embodiment’s data. This was non-obvious in 2022; “robot data” was thought to be too embodiment-specific to share. Cross-embodiment is the foundation of every modern VLA.
3. Practical math — dataset sizes, training compute, inference latency
3.1 Dataset-scaling intuition
Demonstrations per task | Result
1–10 | Single-shot / few-shot learning; only viable with VLA fine-tune
50–500 | BC works for single-task on narrow embodiment
1k–10k | BC + GMM/diffusion robust to environment variation
10k–100k | Cross-task generalization within embodiment
100k–1M | RT-1 / RT-2 / OpenVLA territory; cross-task + cross-embodiment
1M–10M | π0 / RT-X / GR00T scale; broad generalization
3.2 Worked example: data budget for a tabletop sorting task
Goal: sort 5 object types from clutter using a 6-DoF arm + parallel jaw.
Approach A: train BC from scratch.
- 500 demos × 30 s each × 50 Hz = ~750k state-action pairs.
- Collected via leader-follower teleop at ~$5/demo (operator + setup).
- Cost: ~$2500 in data; ~1 hour training on RTX 4090.
Approach B: fine-tune OpenVLA.
- 50 demos for the specific task.
- Cost: ~$250 in data; ~4 hours training on RTX 4090 with LoRA (rank 32).
- Result generally better due to ImageNet + DROID priors.
Approach C: prompt π0 zero-shot.
- 0 demos; just text instruction “sort the objects by color into the bins.”
- Cost: API call.
- Result: works for common objects in 2024+; degrades on novel objects (handled by 0.5 generation).
3.3 Compounding error bound (Ross 2010)
For BC with episode horizon T and per-step error ε:
E[cost] ≤ T² ε + T C_max ε²
Means: at T = 200 (4 s @ 50 Hz) and ε = 0.05, total cost ~2000. With DAgger (linear in T): ~10. Hence DAgger’s appeal for long-horizon tasks.
3.4 Diffusion policy denoising-step budget
Standard DDPM: 1000 training steps, 100 inference. DDIM: 50 inference. Distilled diffusion (e.g., Consistency Models): 2–4 inference. Flow matching: 4–8 ODE solver steps.
At 50 Hz control rate with 4 denoising steps × 4 ms each: 16 ms total — feasible on consumer GPU. RT-2 at 1–3 Hz cannot be the inner loop; needs a fast layer (Helix’s System 1) below it.
3.5 LoRA fine-tuning footprint
LoRA (Hu 2021): freeze base model, learn rank-r updates to attention matrices. For OpenVLA 7B base:
- Full fine-tune: 14 GB weights × ~3 = 42 GB VRAM minimum (Adam state).
- LoRA rank 32: ~150 MB trainable params; ~16 GB VRAM total; trains on single RTX 4090.
This is why VLA adoption has been so fast — labs can fine-tune their workflow without an H100 cluster.
4. Design heuristics
- Don’t BC from raw images alone. Provide proprioception (joint state, gripper state) explicitly. Vision should capture the scene; proprioception should capture the robot.
- Image conditioning needs diversity. If you train with one lighting / background, the policy memorizes pixel patterns. Diversify lighting, distractor objects, table textures.
- Use multi-camera input. Wrist camera + 1–2 external. Wrist gives close-up at the gripper; external gives spatial context. ALOHA: 4 cameras; Stretch: wrist + head + base; UR5 + Realsense: wrist + side standard.
- Action chunking is essential for tasks with periodic patterns (pouring, wiping). Predict 1 s ahead, execute, refresh.
- Use absolute actions or end-effector pose targets, not joint velocities. Robot proprioception is more reliably learnable from absolute targets.
- Filter teleop demos. Bad demos teach bad policies. Reject episodes with timeouts, collisions, abnormal force readings.
- Co-train across tasks when possible. Even 10% of related tasks in the mix improves convergence on the target task.
- Reference frame matters. End-effector-relative actions transfer better across embodiments than world-frame actions.
- Predicting next-image is overkill. Diffusion-Policy + ACT predict actions only; full visual world models (Dreamer V3, GAIA) are research, not deployment.
- Don’t expect BC to recover from failure. If teleop never showed how to recover from a dropped object, the policy won’t know. Either include recovery demos or layer reactive primitives below.
- Use language instructions as conditioning even if not strictly needed. It often improves performance through implicit task-decomposition.
- Benchmark against the same teleop expert. Operator-specific style produces operator-specific success rates; controlled comparison requires the same operator on each method.
5. Components & sourcing — teleop rigs and data engines
5.1 Teleoperation rigs
| Rig | Embodiment | Approx cost | Notes |
|---|---|---|---|
| ALOHA / ALOHA 2 (Stanford) | Dual-arm ViperX 300 | ~$30k DIY | Open-source; leader-follower; standard for bimanual research |
| Mobile ALOHA (Stanford) | ALOHA + Tracer base | ~$32k DIY | Adds mobile base for in-home tasks |
| GELLO (Stanford) | Any 6-DoF arm | ~$200 leader rig | Open-source 3D-printable; works with Franka, UR5, xArm |
| AnyTeleop (Wang 2023) | Various | Software | Vision-based hand tracking; free |
| HandyMOVE / OpenTeach (Iyer 2024) | VR-based | $400 (Quest 3) | Meta Quest teleop with retargeting |
| Wandelbots TracePen | Industrial arm | Commercial | Stylus-based programming |
| Boston Dynamics Spot Tablet | Spot quadruped | Bundled | Game-style controller |
5.2 Datasets
| Dataset | Trajectories | Embodiments | Tasks | License |
|---|---|---|---|---|
| Open-X-Embodiment (2023) | ~1M | 22 | 527 | Apache 2.0 |
| DROID (2024, Stanford+TRI) | 76k | Franka FR3 | 564 | MIT |
| BridgeData V2 (Walke 2023) | 60k | WidowX 250 | 14 task families | MIT |
| RT-1 dataset (Brohan 2022) | 130k | Everyday Robots | 13 skills | Restricted |
| RH20T (Fang 2023) | 110k | Franka, UR5 | 147 | MIT |
| MimicGen (Garrett 2023) | 10k–100k | Various sim | 10+ | MIT |
| Stanford EgoMimic (2024) | 2.8k | Aloha 2 | Manipulation | MIT |
5.3 Open-source policy implementations
| Implementation | Reference paper / org | Use |
|---|---|---|
| LeRobot (Hugging Face 2024) | Various | Production-style training/eval pipeline; supports ACT, Diffusion Policy, π0 |
| diffusion_policy (Chi 2023) | Columbia | Reference Diffusion Policy implementation |
| openpi (Physical Intelligence 2024) | PI | Inference code for π0 / π0.5; open weights |
| octo (Berkeley 2024) | UC Berkeley | Open generalist; small/base/large checkpoints |
| openvla (Stanford TRI 2024) | Stanford | Open VLA 7B; LoRA fine-tuning recipe |
| ACT / aloha (Stanford 2023) | Stanford | ACT reference for ALOHA |
| robomimic (Stanford 2021) | Stanford ARISE | Benchmark suite + baselines |
5.4 Foundation model APIs / checkpoints
| Model | Provider | Access | Notes |
|---|---|---|---|
| π0, π0.5 | Physical Intelligence | Open weights | Flow-matching VLA |
| Octo | UC Berkeley | Open weights | 27M / 93M params |
| OpenVLA-7B | Stanford TRI | Open weights | Llama2-7B + DINOv2/SigLIP |
| RDT-1B | Tsinghua | Open weights | Bimanual diffusion transformer |
| GR00T N1 / N1.5 | NVIDIA | Open weights (research) | Humanoid foundation model |
| RT-2 | Google DeepMind | Closed | Internal only |
| Helix | Figure AI | Closed | Onboard Figure 02 humanoid |
| Gemini Robotics | Google DeepMind | Closed (2025 announcement) | Multimodal Gemini-derived |
6. Reference data
6.1 Method comparison
| Method | Multi-modal | Long-horizon | Compute | Notes |
|---|---|---|---|---|
| BC + MLP | No | Bad | Cheap | Toy baseline |
| BC + GMM | Yes (K modes) | OK | Cheap | Standard pre-2023 |
| DAgger | No | Good | Expert queries | Limited by expert availability |
| ACT (CVAE + transformer) | Yes | Good | Moderate | ALOHA standard |
| Diffusion Policy | Yes (any) | Excellent | Moderate (16 denoising) | 2024+ default |
| Flow Matching | Yes | Excellent | Cheap (4-8 steps) | π0 |
| RT-2 / OpenVLA | Yes (tokenized) | Excellent + language | Heavy (7B+ params) | VLA standard |
| GAIL / AIRL | Yes (implicitly) | Variable | Heavy (RL inner loop) | Mostly research |
6.2 Notable IL milestones
| Year | Project | Demos | Key idea |
|---|---|---|---|
| 1989 | ALVINN (Pomerleau) | 1k frames | First BC on a vehicle (CMU NAVLAB) |
| 1999 | Schaal et al. DMPs | 10s | Dynamic Movement Primitives — parameterized motions |
| 2011 | DAgger (Ross) | online | Linear-in-T regret bound |
| 2016 | GAIL (Ho, Ermon) | mixed | Adversarial IL |
| 2017 | One-shot IL (Duan, Finn) | 1 | Meta-IL framework |
| 2018 | DexPilot (NVIDIA) | teleop | Hand teleop → dexterous policies |
| 2020 | LSTM-GMM Robomimic baseline | 200 | Strong BC baseline |
| 2022 | RT-1 (DeepMind) | 130k | First scaled robot transformer |
| 2023 | ACT (Stanford) | 50/task | Bimanual ALOHA |
| 2023 | Diffusion Policy (Chi et al.) | 100s | Multimodal action distributions |
| 2023 | RT-2 / RT-X / Open-X-Embodiment | 1M | VLA + cross-embodiment |
| 2024 | OpenVLA, Octo, π0 | mixed | Open foundation models |
| 2024 | Helix (Figure) | proprietary | Humanoid VLA, dual-system 200 Hz |
| 2025 | π0.5, GR00T N1.5 | proprietary | Generalization-focused next-gen |
6.3 Practical training recipes
ACT for ALOHA bimanual:
- 50 demos per task
- 50 Hz control rate
- Action chunk H = 100 (2 s lookahead)
- Vision: ViT-B with 4 cameras concatenated
- Style latent z: dim 32, KL weight 10
- Train: 8000 epochs, lr 1e-5, batch 8
- Hardware: single RTX 4090, ~1 day
Diffusion Policy:
- 200 demos per task
- 10 Hz observation, predict 16-step chunk at 10 Hz
- Vision: ResNet-18 (separate per camera)
- 1D-UNet over actions, 100 train denoise steps
- DDIM 16 inference steps
- Train: 3000 epochs, lr 1e-4, EMA
- Hardware: single RTX 4090, ~1 day
OpenVLA LoRA fine-tune:
- 50–200 demos
- LoRA rank 32, alpha 16
- lr 5e-4 for LoRA, frozen base
- Image: DINOv2 + SigLIP fused at ViT-L
- Action head: discrete 256-bin per dim, 7-dim action
- Train: 30k–100k steps, batch 16
- Hardware: single A100 / 4090 with bf16
7. Failure modes & debugging
- Mean-mode collapse — unimodal BC outputs the average of bimodal expert. Detect: action looks like always-straight in obstacle scenarios. Fix: switch to GMM, diffusion, or VQ-discrete head.
- Latency-induced drift — policy trained at 50 Hz, deployed at 30 Hz. Detect: errors grow with horizon. Fix: match rates or include latency in training.
- Causal confusion — policy keys on hand-pose-in-image rather than object-in-world. Detect: hide the robot from the image — does policy still work? Fix: more demos with varied robot poses, image augmentation removing robot.
- Distribution shift on novel objects — works on training objects, fails on a slightly different mug. Fix: more object variety; or VLA pretraining; or test-time augmentation.
- Failure to recover — drops object, freezes. Fix: include recovery demos; layer with classical reach-to-grasp primitives; or train reactive RL on top.
- Teleop artifact replay — policy learned to pause at certain points because the demonstrator paused. Fix: filter out idle frames; normalize by velocity.
- Bimodal modality flicker — Diffusion / VLA policy oscillates between two valid actions. Fix: temporal smoothing / receding horizon execution with chunk size > 1; ensemble across recent chunks.
- Reward-hacking equivalent in IL: the policy mimics surface statistics, not intent. E.g., a “put the cup on the saucer” policy that hovers over the saucer without releasing. Fix: better demonstrations + force signal + explicit success conditions.
- OOD lighting — works under demo lighting, fails under indoor LED. Fix: domain randomization in image augmentation; ImageNet-pretrained vision encoder; or VLA priors.
- Compute-budget mismatch — VLA trained at server, deployed on robot edge — too slow. Fix: distill to smaller policy; or use dual-system (slow VLA generates plans, fast policy executes).
7.1 Quick-debug checklist for new BC policy
1. Inspect demo data:
- Are demos consistent (same operator style)?
- Action distribution per state — is it multimodal?
- Outlier filter: drop episodes with timeouts, force-violations, abnormal episode length.
2. Sanity-train on a small subset (100 transitions):
- If overfit easily → architecture is fine, data may be insufficient.
- If can't overfit → bug in data pipeline or model.
3. Open-loop replay:
- Predict actions from observations in demo; compare to recorded actions.
- Median error should be < 5% of action range.
4. Closed-loop sim eval:
- Run policy in sim with demo's initial conditions; record success.
- Should match teleop demonstrator (within 10%).
5. Closed-loop real:
- Try in a controlled setup matching training distribution.
- Note systematic biases (always lifts too high, always misses left).
- Diagnose: data, model architecture, or reward/objective.
7.2 Comparative metrics
Success rate:
- Most common headline metric.
- Subject to evaluator bias; specify protocol (number of trials, end-condition).
Task completion time:
- Vs demonstrator average. Often higher for IL (more conservative).
Smoothness (action jerk):
- Lower jerk = smoother; high jerk indicates oscillation.
Distance to demo trajectory:
- L2 distance from policy trajectory to nearest demo trajectory in pose space.
Generalization gap:
- Train tasks vs held-out test tasks.
Robustness:
- Performance under disturbance (push, occlusion, lighting change).
8. Case studies
8.1 ALOHA folding shirts (Zhao 2023, Stanford)
Setup: dual ViperX 300 7-DoF arms in leader-follower; 4 cameras (2 wrist, 1 chest, 1 environment). Task: pick up shirt from table, fold, place. ~50 demos per shirt configuration. Method: ACT with H=100 action chunk at 50 Hz, ViT-B vision, CVAE decoder with z-latent. Result: ~90% success on training shirts, ~70% on novel shirts (different size/color). Public demo at NeurIPS 2023 attracted significant attention.
8.2 Mobile ALOHA cooking (Fu 2024, Stanford)
Mobile ALOHA: ALOHA + Tracer base. Bimanual + driving. Task: cook shrimp, deliver food to person, wash dishes. ~50 demos per skill, then language-conditioned switching. Method: ACT + co-training with static ALOHA dataset. Result: end-to-end 30-minute task completion in 8/10 attempts at Stanford lab. Cited as proof that mobile bimanual long-horizon manipulation is feasible with current IL stack.
8.3 Diffusion Policy on robomimic (Chi 2023)
Benchmark: square peg-in-hole, tool hang, can transport, lift block — all single-arm tasks with image observations. Compared: BC-RNN-GMM (prior SOTA) vs Diffusion Policy. Result: 25–45% improvement in success rate across tasks; canonical 2023 result establishing diffusion as the new default.
8.4 π0 zero-shot generalization (Physical Intelligence 2024)
Model: π0 with 3B params, flow-matching VLA. Training data: ~1M demonstrations across 7 different robot platforms (Franka, UR5, ALOHA, bimanual, mobile manipulator). Inference: 50 Hz on single H100. Public demos: laundry folding, table bussing, packing groceries, on robots that contributed zero training data (e.g., 1X EVE after model already trained on Franka data). Significance: first commercial-quality VLA released open-weights; spawned wave of integrators in 2024–2025.
8.5 Figure Helix on Figure 02 (Feb 2025)
Architecture: dual-system. System 1 (S1) — fast (200 Hz), reactive, low-level motor control via a small ViT-style network. System 2 (S2) — slow (5-10 Hz), reasoning, large VLM that issues sub-goals to S1. Hardware: Figure 02 humanoid, 35 DoF, onboard NVIDIA AGX Orin. Demo task: two humanoids cooperate to put groceries away in a kitchen. Continuous voice instruction; one humanoid hands items to the other. Result: first public demo of a humanoid VLA running fully onboard at production rates. Sparked the “humanoid foundation model” deployment wave at Figure, 1X, Apptronik, NVIDIA GR00T, and the Tesla Optimus claims.
8.6 NVIDIA GR00T (2025)
GR00T N1 + N1.5 (Generalist Robot 00 Technology) — NVIDIA’s humanoid foundation model lineage.
- Architecture: cross-embodiment foundation model trained on a large mixture of human-video data, mocap, sim trajectories, and real-robot demos.
- Embodiments: Unitree H1/G1, Apptronik Apollo, 1X NEO, Fourier GR-1, Boston Dynamics electric Atlas (partner integrations).
- Distribution: open weights for research, commercial license for OEMs.
- Inference: optimized for NVIDIA Jetson AGX Thor (next-gen, ships 2025).
GR00T’s role: NVIDIA aims to be the “Android of humanoids” — sell the compute (Thor / Orin) + foundation model + Isaac Lab simulator stack to humanoid OEMs who lack the in-house RL/IL talent.
8.7 1X NEO Gamma at homes (2024–2025)
1X Technologies, Norwegian humanoid maker. NEO Gamma (2024): cable-driven (compliant) soft drivetrain for human-safe forces. Currently shipping to in-home pilot customers (Bay Area early adopters) with autonomy + remote operator backup. Tasks demonstrated: opening doors, picking up items, fetching from refrigerator, conversational assistance. Combines: VLA backbone (1X in-house) + diffusion-style action head + remote-operator handoff for failures. Significant for being the first humanoid pilot with paid end-user deployment in homes (vs. industrial).
8.8 Open-X-Embodiment / RT-X (2023)
22 institutions pooled their robot datasets (Open-X-Embodiment) — total 1M trajectories, 22 embodiments, 527 skills. RT-1-X (RT-1 trained on the union) and RT-2-X (similar with RT-2 backbone). Result: RT-1-X improved over each lab’s per-lab RT-1 by ~50% on average. Established cross-embodiment training as the standard recipe; led directly to RT-X being the recipe baseline for OpenVLA, Octo, and π0.
8.9 LeRobot — Hugging Face’s open IL stack (2024)
LeRobot (Hugging Face, late 2024) released an end-to-end open-source pipeline:
- Common dataset format (HuggingFace Hub-hosted).
- Reference implementations of ACT, Diffusion Policy, π0, π0-FAST.
- Training + eval scripts.
- Affordable Koch arm (~1500 follower) for hands-on labs.
-
100 community datasets posted within first 6 months.
Significance: lowered the entry barrier for university labs + hobbyists. Most academic IL papers in 2025+ benchmark against LeRobot baselines.
8.10 Diffusion Policy on bin picking (Mahler / Berkeley AUTOLAB 2024)
Combined Dex-Net 4.0 grasp-selection prior with Diffusion Policy refinement:
- Dex-Net proposes top-K grasp candidates (depth-based).
- Diffusion Policy refines the approach trajectory conditional on the candidate grasp.
- Per-bin success rate ~93% on cluttered real objects (~10% improvement over Dex-Net alone).
- Trained on 50k synthetic demos + 500 real refinement demos.
Demonstrates the “model-based prior + learned refinement” pattern that’s becoming standard for IL deployment in tight-tolerance industrial tasks.
Adjacent
- realtime-embedded
- digital-control
- adaptive-control
- microcontrollers
- signal-processing-dsp
- fpga-design
- ergonomics-human-factors
- quality-systems-iso9001
Citations
- Pomerleau D.A. (1989) “ALVINN: An Autonomous Land Vehicle In a Neural Network.” NIPS ‘88.
- Ross S., Bagnell D. (2010) “Efficient Reductions for Imitation Learning.” AISTATS.
- Ross S., Gordon G., Bagnell D. (2011) “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning” (DAgger). AISTATS.
- Ziebart B.D., Maas A., Bagnell J.A., Dey A.K. (2008) “Maximum Entropy Inverse Reinforcement Learning.” AAAI.
- Ho J., Ermon S. (2016) “Generative Adversarial Imitation Learning” (GAIL). NeurIPS.
- Fu J., Luo K., Levine S. (2018) “Learning Robust Rewards with Adversarial Inverse Reinforcement Learning” (AIRL). ICLR.
- de Haan P., Jayaraman D., Levine S. (2019) “Causal Confusion in Imitation Learning.” NeurIPS.
- Brohan A. et al. (2022) “RT-1: Robotics Transformer for Real-World Control at Scale.” arXiv:2212.06817.
- Brohan A. et al. (2023) “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv:2307.15818.
- Padalkar A. et al. (2023) “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv:2310.08864.
- Kim M.J. et al. (2024) “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv:2406.09246.
- Physical Intelligence (2024) “π0: A Vision-Language-Action Flow Model for General Robot Control.” physicalintelligence.company.
- Octo Team (2024) “Octo: An Open-Source Generalist Robot Policy.” arXiv:2405.12213.
- Chi C., Feng S., Du Y., et al. (2023) “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS.
- Zhao T., Kumar V., Levine S., Finn C. (2023) “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” (ACT). RSS.
- Fu Z. et al. (2024) “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” arXiv:2401.02117.
- Lipman Y. et al. (2023) “Flow Matching for Generative Modeling.” ICLR.
- Hu E.J. et al. (2021) “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv:2106.09685.
- Walke H. et al. (2023) “BridgeData V2: A Dataset for Robot Learning at Scale.” CoRL.
- Khazatsky A. et al. (2024) “DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset.” arXiv:2403.12945.
- NVIDIA (2025) “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.” nvidia.com.
- Figure AI (2025) “Helix: A Vision-Language-Action Model for Humanoid Robots.” figure.ai blog.