Imitation Learning — LfD, Diffusion Policies, VLA Models for Robotics

Training robots from demonstration rather than reward: behavior cloning (BC), DAgger, inverse RL, generative-model policies (Diffusion Policy, RDT-1B, ACT), and the 2023–2026 generation of vision-language-action (VLA) models — RT-1, RT-2, RT-X, OpenVLA, π0, π0.5, Octo, Helix. These now constitute the dominant paradigm for general-purpose manipulation: a single network trained on tens of thousands of robot demonstrations (Open-X-Embodiment, DROID, BridgeData V2) acts across embodiments and tasks with text or image goals.

1. At a glance

Imitation learning (IL) trains a policy π(a | o) to mimic expert demonstrations D = {(o_t, a_t)} without reward signal. The simplest form, behavior cloning (BC), is supervised learning: minimize cross-entropy / MSE between policy and expert action. Despite being the oldest method (Pomerleau 1989, ALVINN), BC has come back as the workhorse of 2023–2026 robot learning, on the back of three changes: (1) datasets large enough that compounding distribution shift no longer dominates, (2) generative-model policies (diffusion, transformer) that capture multimodal expert behavior, and (3) shared cross-embodiment internet-scale pretraining (vision-language models).

The 2024–2026 robot foundation-model wave is the consequence:

RT-1 (Brohan 2022): 35M params, trained on 130k demos of 13 skills × 700 tasks, robot transformer for Everyday Robots’ mobile manipulator.
RT-2 (Brohan 2023): co-finetune a PaLM-E or PaLI-X VLM on robot trajectories; output text-tokenized actions. 5B–55B params.
Open-X-Embodiment / RT-X (2023): 22 embodiments, 527 skills, ~1M trajectories aggregated across 21 institutions; established the cross-embodiment dataset.
OpenVLA (Kim 2024): 7B param open VLA released by Stanford / TRI / UC Berkeley / Google.
π0 + π0.5 (Physical Intelligence 2024–2025): 3B param flow-matching VLA, runs at 50 Hz; π0.5 adds generalization beyond training distribution.
Octo (Berkeley 2024): 27M / 93M open-source generalist diffusion transformer.
Helix (Figure 2024): humanoid VLA running on Figure 02 dual-system at 200 Hz.
GR00T N1 / N1.5 (NVIDIA 2025): humanoid foundation models.

This shift means a robot built today inherits prior demonstrations from 100+ other robots. The architectural question for a new lab is no longer “what’s our policy” but “do we fine-tune OpenVLA or train from scratch?”

Where it sits. Manipulator / mobile base is the embodiment; teleop rigs (ALOHA, GELLO, leader-follower, VR) collect demonstrations; vision encoders (DINOv2, CLIP, SigLIP, ViT) ground the policy; sim augments scarce real demos via co-training; RL fine-tunes the IL policy or replaces it for explore-heavy tasks.

First ask. Are demonstrations cheap or expensive? Cheap (kinesthetic / leader-follower teleop, < $1/ d e m o) \to BC sc a l es . E x p e n s i v e (ha pt i c t e l eo p a t$ 10/demo, scarce expert) → leverage VLA pretraining. Is the action distribution multimodal? Pick-and-place from arbitrary clutter → yes; switch to diffusion / VQ-discrete-token policies. Unimodal (line following, single-mode dynamics) → BC with MLP / GMM head suffices. Single task or family? Single → BC fine-tune. Family with language conditioning → VLA. Open-loop fine? If the task has tight tolerance (insertion, ironing), pair BC with force feedback or residual RL.

2. First principles

2.1 Behavior cloning (BC)

Given expert dataset D = {(o_t, a_t)} drawn from expert distribution d^π_E, fit π_θ to minimize:

L_BC(θ) = E_{(o,a)~D} [−log π_θ(a | o)] (discrete actions) L_BC(θ) = E_{(o,a)~D} [||π_θ(o) − a||²] (continuous, deterministic)

Limitations:

Distribution shift: π_θ may visit states o’ not in D, and there π_θ has no signal. Errors compound as O(T²) over horizon T (Ross 2010).
Multi-modal experts: if experts choose left or right with equal frequency, MSE-fit policy outputs the mean — straight ahead, often catastrophic.
Causal confusion (de Haan 2019): policy keys on spurious correlates rather than the true cause, fails OOD.

2.2 DAgger (Ross 2011)

Dataset Aggregation. Iterate:

Train π_i on aggregated dataset D_i.
Roll out π_i, query expert for actions at visited states.
Add (state, expert-action) pairs to D_{i+1} = D_i ∪ {new}.
Repeat.

Recovers O(T) regret instead of O(T²). Requires online expert queries — costly in the real world but feasible with simulators or with cheap recovery (kinesthetic correction). DAgger and its variants (HG-DAgger, SafeDAgger) are still standard for high-precision tasks where BC fails.

2.3 Inverse reinforcement learning (IRL)

Rather than copy actions, infer the reward the expert is optimizing, then RL against that reward. Approaches:

Maximum entropy IRL (Ziebart 2008): assume expert is exponentially-biased toward higher-reward trajectories.
Generative adversarial imitation learning (GAIL, Ho 2016): train policy + discriminator simultaneously; discriminator distinguishes expert from policy, policy maximizes “fooling” the discriminator.
Adversarial Inverse RL (AIRL, Fu 2018): extract a reward function from GAIL discriminator.

IRL has tractable issues (reward is underdetermined; multiple rewards fit any demonstrations); largely supplanted by direct policy methods in 2020+.

When the expert action distribution at a given state is multimodal (e.g., “go around obstacle left or right”), unimodal policies fail. Approaches:

Gaussian Mixture Model (GMM) head: output K means + covariances + mixing weights. Used in PROMP, ALOHA-ACT’s predecessor.
Mixture Density Network (MDN, Bishop 1994): similar; neural-net outputs of GMM params.
VQ-VAE quantized actions (RT-1, RT-2): discretize the action space into N=256 bins per dim; cross-entropy loss; multimodal naturally.
Diffusion policy (Chi 2023): the policy is a denoising network that iteratively refines noise into action; trained by score-matching. Captures arbitrary action distributions.
Flow matching (Lipman 2023): a continuous-time variant of diffusion; π0 uses flow matching for speed.
Categorical autoregressive (RT-2, OpenVLA): tokenize actions; predict token-by-token like a language model.

2.5 Diffusion policy (Chi et al. 2023)

The seminal paper. Action a is treated as a noisy variable; learn ε_θ(a^k, o, k) that predicts the noise added at step k. At inference, start from Gaussian noise a^K, iteratively denoise to a^0 = clean action:

a^{k-1} = a^k - ε_θ(a^k, o, k) * step_size + noise(k)

Properties:

Captures multimodal distributions inherently.
Conditions on observation o (image + proprioception).
Outputs an action chunk H steps long, executed open-loop. Receding horizon: predict, execute K < H, predict again.
100 denoising steps at training; 16 at inference (DDIM); 4–8 with consistency models or flow-matching.

Performance: ~70% improvement over LSTM-GMM on Robomimic benchmarks; standard for ALOHA-style two-arm tasks since 2024.

2.6 Action chunking (ACT, Zhao 2023)

ALOHA team’s transformer-based BC variant. Key choices:

Conditional VAE (CVAE) decoder over action chunks (H = 100 time-steps at 50 Hz).
Encoder sees full episode; decoder sees current observations + style latent.
Temporal ensembling: at runtime, average predictions from multiple recent chunks at each timestep — smooths execution.

Cheaper and faster to train than diffusion policy; competitive for single-trajectory tasks. Standard for bimanual ALOHA / Aloha-Mobile / TidyBot-style work.

2.7 Vision-Language-Action models (VLAs)

Architecture (RT-2 / OpenVLA / π0 lineage):

Vision encoder (e.g., DINOv2, SigLIP, CLIP ViT-L) → image tokens.
Optional language tokenization (T5, LLaMA tokenizer) → text tokens.
Cross-attention transformer / LLM backbone (PaLI / LLaMA-2 / Mistral) → fused tokens.
Action head: discrete tokens (RT-2: 256-bin per dim) OR flow-matching head (π0) OR diffusion head (Octo).

Training: co-finetune on (a) internet image-text pairs, (b) robot trajectories with language instructions. The internet data provides semantic grounding (“the red block”, “the blue mug”); robot data provides action grounding.

Inference latency:

RT-2 55B: 1-3 Hz (slow; needs server GPU).
OpenVLA 7B: 6 Hz on A100.
Octo-Base 93M: 50 Hz on consumer GPU.
π0 3B: 50 Hz (custom flow-matching).
Helix (Figure): split System 1 (200 Hz reactive) + System 2 (5-10 Hz reasoning) on humanoid onboard compute.

2.8 Cross-embodiment learning

Open-X-Embodiment (Padalkar 2023) showed that pooling demonstrations across 22 robot embodiments (Franka, UR5, xArm, ALOHA, Spot, Stretch, Fetch, etc.) gives positive transfer — training on the mixture improves per-embodiment performance over training only on that embodiment’s data. This was non-obvious in 2022; “robot data” was thought to be too embodiment-specific to share. Cross-embodiment is the foundation of every modern VLA.

3. Practical math — dataset sizes, training compute, inference latency

3.1 Dataset-scaling intuition

Demonstrations per task | Result
1–10                    | Single-shot / few-shot learning; only viable with VLA fine-tune
50–500                  | BC works for single-task on narrow embodiment
1k–10k                  | BC + GMM/diffusion robust to environment variation
10k–100k                | Cross-task generalization within embodiment
100k–1M                 | RT-1 / RT-2 / OpenVLA territory; cross-task + cross-embodiment
1M–10M                  | π0 / RT-X / GR00T scale; broad generalization

3.2 Worked example: data budget for a tabletop sorting task

Goal: sort 5 object types from clutter using a 6-DoF arm + parallel jaw.

Approach A: train BC from scratch.

500 demos × 30 s each × 50 Hz = ~750k state-action pairs.
Collected via leader-follower teleop at ~$5/demo (operator + setup).
Cost: ~$2500 in data; ~1 hour training on RTX 4090.

Approach B: fine-tune OpenVLA.

50 demos for the specific task.
Cost: ~$250 in data; ~4 hours training on RTX 4090 with LoRA (rank 32).
Result generally better due to ImageNet + DROID priors.

Approach C: prompt π0 zero-shot.

0 demos; just text instruction “sort the objects by color into the bins.”
Cost: API call.
Result: works for common objects in 2024+; degrades on novel objects (handled by 0.5 generation).

3.3 Compounding error bound (Ross 2010)

For BC with episode horizon T and per-step error ε:

E[cost] ≤ T² ε + T C_max ε²

Means: at T = 200 (4 s @ 50 Hz) and ε = 0.05, total cost ~2000. With DAgger (linear in T): ~10. Hence DAgger’s appeal for long-horizon tasks.

3.4 Diffusion policy denoising-step budget

Standard DDPM: 1000 training steps, 100 inference. DDIM: 50 inference. Distilled diffusion (e.g., Consistency Models): 2–4 inference. Flow matching: 4–8 ODE solver steps.

At 50 Hz control rate with 4 denoising steps × 4 ms each: 16 ms total — feasible on consumer GPU. RT-2 at 1–3 Hz cannot be the inner loop; needs a fast layer (Helix’s System 1) below it.

3.5 LoRA fine-tuning footprint

LoRA (Hu 2021): freeze base model, learn rank-r updates to attention matrices. For OpenVLA 7B base:

Full fine-tune: 14 GB weights × ~3 = 42 GB VRAM minimum (Adam state).
LoRA rank 32: ~150 MB trainable params; ~16 GB VRAM total; trains on single RTX 4090.

This is why VLA adoption has been so fast — labs can fine-tune their workflow without an H100 cluster.

4. Design heuristics

Don’t BC from raw images alone. Provide proprioception (joint state, gripper state) explicitly. Vision should capture the scene; proprioception should capture the robot.
Image conditioning needs diversity. If you train with one lighting / background, the policy memorizes pixel patterns. Diversify lighting, distractor objects, table textures.
Use multi-camera input. Wrist camera + 1–2 external. Wrist gives close-up at the gripper; external gives spatial context. ALOHA: 4 cameras; Stretch: wrist + head + base; UR5 + Realsense: wrist + side standard.
Action chunking is essential for tasks with periodic patterns (pouring, wiping). Predict 1 s ahead, execute, refresh.
Use absolute actions or end-effector pose targets, not joint velocities. Robot proprioception is more reliably learnable from absolute targets.
Filter teleop demos. Bad demos teach bad policies. Reject episodes with timeouts, collisions, abnormal force readings.
Co-train across tasks when possible. Even 10% of related tasks in the mix improves convergence on the target task.
Reference frame matters. End-effector-relative actions transfer better across embodiments than world-frame actions.
Predicting next-image is overkill. Diffusion-Policy + ACT predict actions only; full visual world models (Dreamer V3, GAIA) are research, not deployment.
Don’t expect BC to recover from failure. If teleop never showed how to recover from a dropped object, the policy won’t know. Either include recovery demos or layer reactive primitives below.
Use language instructions as conditioning even if not strictly needed. It often improves performance through implicit task-decomposition.
Benchmark against the same teleop expert. Operator-specific style produces operator-specific success rates; controlled comparison requires the same operator on each method.

5. Components & sourcing — teleop rigs and data engines

5.1 Teleoperation rigs

Rig	Embodiment	Approx cost	Notes
ALOHA / ALOHA 2 (Stanford)	Dual-arm ViperX 300	~$30k DIY	Open-source; leader-follower; standard for bimanual research
Mobile ALOHA (Stanford)	ALOHA + Tracer base	~$32k DIY	Adds mobile base for in-home tasks
GELLO (Stanford)	Any 6-DoF arm	~$200 leader rig	Open-source 3D-printable; works with Franka, UR5, xArm
AnyTeleop (Wang 2023)	Various	Software	Vision-based hand tracking; free
HandyMOVE / OpenTeach (Iyer 2024)	VR-based	$400 (Quest 3)	Meta Quest teleop with retargeting
Wandelbots TracePen	Industrial arm	Commercial	Stylus-based programming
Boston Dynamics Spot Tablet	Spot quadruped	Bundled	Game-style controller

5.2 Datasets

Dataset	Trajectories	Embodiments	Tasks	License
Open-X-Embodiment (2023)	~1M	22	527	Apache 2.0
DROID (2024, Stanford+TRI)	76k	Franka FR3	564	MIT
BridgeData V2 (Walke 2023)	60k	WidowX 250	14 task families	MIT
RT-1 dataset (Brohan 2022)	130k	Everyday Robots	13 skills	Restricted
RH20T (Fang 2023)	110k	Franka, UR5	147	MIT
MimicGen (Garrett 2023)	10k–100k	Various sim	10+	MIT
Stanford EgoMimic (2024)	2.8k	Aloha 2	Manipulation	MIT

5.3 Open-source policy implementations

Implementation	Reference paper / org	Use
LeRobot (Hugging Face 2024)	Various	Production-style training/eval pipeline; supports ACT, Diffusion Policy, π0
diffusion_policy (Chi 2023)	Columbia	Reference Diffusion Policy implementation
openpi (Physical Intelligence 2024)	PI	Inference code for π0 / π0.5; open weights
octo (Berkeley 2024)	UC Berkeley	Open generalist; small/base/large checkpoints
openvla (Stanford TRI 2024)	Stanford	Open VLA 7B; LoRA fine-tuning recipe
ACT / aloha (Stanford 2023)	Stanford	ACT reference for ALOHA
robomimic (Stanford 2021)	Stanford ARISE	Benchmark suite + baselines

5.4 Foundation model APIs / checkpoints

Model	Provider	Access	Notes
π0, π0.5	Physical Intelligence	Open weights	Flow-matching VLA
Octo	UC Berkeley	Open weights	27M / 93M params
OpenVLA-7B	Stanford TRI	Open weights	Llama2-7B + DINOv2/SigLIP
RDT-1B	Tsinghua	Open weights	Bimanual diffusion transformer
GR00T N1 / N1.5	NVIDIA	Open weights (research)	Humanoid foundation model
RT-2	Google DeepMind	Closed	Internal only
Helix	Figure AI	Closed	Onboard Figure 02 humanoid
Gemini Robotics	Google DeepMind	Closed (2025 announcement)	Multimodal Gemini-derived

6. Reference data

6.1 Method comparison

Method	Multi-modal	Long-horizon	Compute	Notes
BC + MLP	No	Bad	Cheap	Toy baseline
BC + GMM	Yes (K modes)	OK	Cheap	Standard pre-2023
DAgger	No	Good	Expert queries	Limited by expert availability
ACT (CVAE + transformer)	Yes	Good	Moderate	ALOHA standard
Diffusion Policy	Yes (any)	Excellent	Moderate (16 denoising)	2024+ default
Flow Matching	Yes	Excellent	Cheap (4-8 steps)	π0
RT-2 / OpenVLA	Yes (tokenized)	Excellent + language	Heavy (7B+ params)	VLA standard
GAIL / AIRL	Yes (implicitly)	Variable	Heavy (RL inner loop)	Mostly research

6.2 Notable IL milestones

Year	Project	Demos	Key idea
1989	ALVINN (Pomerleau)	1k frames	First BC on a vehicle (CMU NAVLAB)
1999	Schaal et al. DMPs	10s	Dynamic Movement Primitives — parameterized motions
2011	DAgger (Ross)	online	Linear-in-T regret bound
2016	GAIL (Ho, Ermon)	mixed	Adversarial IL
2017	One-shot IL (Duan, Finn)	1	Meta-IL framework
2018	DexPilot (NVIDIA)	teleop	Hand teleop → dexterous policies
2020	LSTM-GMM Robomimic baseline	200	Strong BC baseline
2022	RT-1 (DeepMind)	130k	First scaled robot transformer
2023	ACT (Stanford)	50/task	Bimanual ALOHA
2023	Diffusion Policy (Chi et al.)	100s	Multimodal action distributions
2023	RT-2 / RT-X / Open-X-Embodiment	1M	VLA + cross-embodiment
2024	OpenVLA, Octo, π0	mixed	Open foundation models
2024	Helix (Figure)	proprietary	Humanoid VLA, dual-system 200 Hz
2025	π0.5, GR00T N1.5	proprietary	Generalization-focused next-gen

6.3 Practical training recipes

ACT for ALOHA bimanual:
  - 50 demos per task
  - 50 Hz control rate
  - Action chunk H = 100 (2 s lookahead)
  - Vision: ViT-B with 4 cameras concatenated
  - Style latent z: dim 32, KL weight 10
  - Train: 8000 epochs, lr 1e-5, batch 8
  - Hardware: single RTX 4090, ~1 day

Diffusion Policy:
  - 200 demos per task
  - 10 Hz observation, predict 16-step chunk at 10 Hz
  - Vision: ResNet-18 (separate per camera)
  - 1D-UNet over actions, 100 train denoise steps
  - DDIM 16 inference steps
  - Train: 3000 epochs, lr 1e-4, EMA
  - Hardware: single RTX 4090, ~1 day

OpenVLA LoRA fine-tune:
  - 50–200 demos
  - LoRA rank 32, alpha 16
  - lr 5e-4 for LoRA, frozen base
  - Image: DINOv2 + SigLIP fused at ViT-L
  - Action head: discrete 256-bin per dim, 7-dim action
  - Train: 30k–100k steps, batch 16
  - Hardware: single A100 / 4090 with bf16

7. Failure modes & debugging

Mean-mode collapse — unimodal BC outputs the average of bimodal expert. Detect: action looks like always-straight in obstacle scenarios. Fix: switch to GMM, diffusion, or VQ-discrete head.
Latency-induced drift — policy trained at 50 Hz, deployed at 30 Hz. Detect: errors grow with horizon. Fix: match rates or include latency in training.
Causal confusion — policy keys on hand-pose-in-image rather than object-in-world. Detect: hide the robot from the image — does policy still work? Fix: more demos with varied robot poses, image augmentation removing robot.
Distribution shift on novel objects — works on training objects, fails on a slightly different mug. Fix: more object variety; or VLA pretraining; or test-time augmentation.
Failure to recover — drops object, freezes. Fix: include recovery demos; layer with classical reach-to-grasp primitives; or train reactive RL on top.
Teleop artifact replay — policy learned to pause at certain points because the demonstrator paused. Fix: filter out idle frames; normalize by velocity.
Bimodal modality flicker — Diffusion / VLA policy oscillates between two valid actions. Fix: temporal smoothing / receding horizon execution with chunk size > 1; ensemble across recent chunks.
Reward-hacking equivalent in IL: the policy mimics surface statistics, not intent. E.g., a “put the cup on the saucer” policy that hovers over the saucer without releasing. Fix: better demonstrations + force signal + explicit success conditions.
OOD lighting — works under demo lighting, fails under indoor LED. Fix: domain randomization in image augmentation; ImageNet-pretrained vision encoder; or VLA priors.
Compute-budget mismatch — VLA trained at server, deployed on robot edge — too slow. Fix: distill to smaller policy; or use dual-system (slow VLA generates plans, fast policy executes).

7.1 Quick-debug checklist for new BC policy

1. Inspect demo data:
   - Are demos consistent (same operator style)?
   - Action distribution per state — is it multimodal?
   - Outlier filter: drop episodes with timeouts, force-violations, abnormal episode length.

2. Sanity-train on a small subset (100 transitions):
   - If overfit easily → architecture is fine, data may be insufficient.
   - If can't overfit → bug in data pipeline or model.

3. Open-loop replay:
   - Predict actions from observations in demo; compare to recorded actions.
   - Median error should be < 5% of action range.

4. Closed-loop sim eval:
   - Run policy in sim with demo's initial conditions; record success.
   - Should match teleop demonstrator (within 10%).

5. Closed-loop real:
   - Try in a controlled setup matching training distribution.
   - Note systematic biases (always lifts too high, always misses left).
   - Diagnose: data, model architecture, or reward/objective.

7.2 Comparative metrics

Success rate:
  - Most common headline metric.
  - Subject to evaluator bias; specify protocol (number of trials, end-condition).

Task completion time:
  - Vs demonstrator average. Often higher for IL (more conservative).

Smoothness (action jerk):
  - Lower jerk = smoother; high jerk indicates oscillation.

Distance to demo trajectory:
  - L2 distance from policy trajectory to nearest demo trajectory in pose space.

Generalization gap:
  - Train tasks vs held-out test tasks.

Robustness:
  - Performance under disturbance (push, occlusion, lighting change).

8. Case studies

8.1 ALOHA folding shirts (Zhao 2023, Stanford)

Setup: dual ViperX 300 7-DoF arms in leader-follower; 4 cameras (2 wrist, 1 chest, 1 environment). Task: pick up shirt from table, fold, place. ~50 demos per shirt configuration. Method: ACT with H=100 action chunk at 50 Hz, ViT-B vision, CVAE decoder with z-latent. Result: ~90% success on training shirts, ~70% on novel shirts (different size/color). Public demo at NeurIPS 2023 attracted significant attention.

8.2 Mobile ALOHA cooking (Fu 2024, Stanford)

Mobile ALOHA: ALOHA + Tracer base. Bimanual + driving. Task: cook shrimp, deliver food to person, wash dishes. ~50 demos per skill, then language-conditioned switching. Method: ACT + co-training with static ALOHA dataset. Result: end-to-end 30-minute task completion in 8/10 attempts at Stanford lab. Cited as proof that mobile bimanual long-horizon manipulation is feasible with current IL stack.

8.3 Diffusion Policy on robomimic (Chi 2023)

Benchmark: square peg-in-hole, tool hang, can transport, lift block — all single-arm tasks with image observations. Compared: BC-RNN-GMM (prior SOTA) vs Diffusion Policy. Result: 25–45% improvement in success rate across tasks; canonical 2023 result establishing diffusion as the new default.

8.4 π0 zero-shot generalization (Physical Intelligence 2024)

Model: π0 with 3B params, flow-matching VLA. Training data: ~1M demonstrations across 7 different robot platforms (Franka, UR5, ALOHA, bimanual, mobile manipulator). Inference: 50 Hz on single H100. Public demos: laundry folding, table bussing, packing groceries, on robots that contributed zero training data (e.g., 1X EVE after model already trained on Franka data). Significance: first commercial-quality VLA released open-weights; spawned wave of integrators in 2024–2025.

8.5 Figure Helix on Figure 02 (Feb 2025)

Architecture: dual-system. System 1 (S1) — fast (200 Hz), reactive, low-level motor control via a small ViT-style network. System 2 (S2) — slow (5-10 Hz), reasoning, large VLM that issues sub-goals to S1. Hardware: Figure 02 humanoid, 35 DoF, onboard NVIDIA AGX Orin. Demo task: two humanoids cooperate to put groceries away in a kitchen. Continuous voice instruction; one humanoid hands items to the other. Result: first public demo of a humanoid VLA running fully onboard at production rates. Sparked the “humanoid foundation model” deployment wave at Figure, 1X, Apptronik, NVIDIA GR00T, and the Tesla Optimus claims.

8.6 NVIDIA GR00T (2025)

GR00T N1 + N1.5 (Generalist Robot 00 Technology) — NVIDIA’s humanoid foundation model lineage.

Architecture: cross-embodiment foundation model trained on a large mixture of human-video data, mocap, sim trajectories, and real-robot demos.
Embodiments: Unitree H1/G1, Apptronik Apollo, 1X NEO, Fourier GR-1, Boston Dynamics electric Atlas (partner integrations).
Distribution: open weights for research, commercial license for OEMs.
Inference: optimized for NVIDIA Jetson AGX Thor (next-gen, ships 2025).

GR00T’s role: NVIDIA aims to be the “Android of humanoids” — sell the compute (Thor / Orin) + foundation model + Isaac Lab simulator stack to humanoid OEMs who lack the in-house RL/IL talent.

8.7 1X NEO Gamma at homes (2024–2025)

1X Technologies, Norwegian humanoid maker. NEO Gamma (2024): cable-driven (compliant) soft drivetrain for human-safe forces. Currently shipping to in-home pilot customers (Bay Area early adopters) with autonomy + remote operator backup. Tasks demonstrated: opening doors, picking up items, fetching from refrigerator, conversational assistance. Combines: VLA backbone (1X in-house) + diffusion-style action head + remote-operator handoff for failures. Significant for being the first humanoid pilot with paid end-user deployment in homes (vs. industrial).

8.8 Open-X-Embodiment / RT-X (2023)

22 institutions pooled their robot datasets (Open-X-Embodiment) — total 1M trajectories, 22 embodiments, 527 skills. RT-1-X (RT-1 trained on the union) and RT-2-X (similar with RT-2 backbone). Result: RT-1-X improved over each lab’s per-lab RT-1 by ~50% on average. Established cross-embodiment training as the standard recipe; led directly to RT-X being the recipe baseline for OpenVLA, Octo, and π0.

8.9 LeRobot — Hugging Face’s open IL stack (2024)

LeRobot (Hugging Face, late 2024) released an end-to-end open-source pipeline:

Common dataset format (HuggingFace Hub-hosted).
Reference implementations of ACT, Diffusion Policy, π0, π0-FAST.
Training + eval scripts.
Affordable Koch arm (~ $200 l e a d err i g,$ 1500 follower) for hands-on labs.
100 community datasets posted within first 6 months.

Significance: lowered the entry barrier for university labs + hobbyists. Most academic IL papers in 2025+ benchmark against LeRobot baselines.

8.10 Diffusion Policy on bin picking (Mahler / Berkeley AUTOLAB 2024)

Combined Dex-Net 4.0 grasp-selection prior with Diffusion Policy refinement:

Dex-Net proposes top-K grasp candidates (depth-based).
Diffusion Policy refines the approach trajectory conditional on the candidate grasp.
Per-bin success rate ~93% on cluttered real objects (~10% improvement over Dex-Net alone).
Trained on 50k synthetic demos + 500 real refinement demos.

Demonstrates the “model-based prior + learned refinement” pattern that’s becoming standard for IL deployment in tight-tolerance industrial tasks.

Adjacent

Citations

Pomerleau D.A. (1989) “ALVINN: An Autonomous Land Vehicle In a Neural Network.” NIPS ‘88.
Ross S., Bagnell D. (2010) “Efficient Reductions for Imitation Learning.” AISTATS.
Ross S., Gordon G., Bagnell D. (2011) “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning” (DAgger). AISTATS.
Ziebart B.D., Maas A., Bagnell J.A., Dey A.K. (2008) “Maximum Entropy Inverse Reinforcement Learning.” AAAI.
Ho J., Ermon S. (2016) “Generative Adversarial Imitation Learning” (GAIL). NeurIPS.
Fu J., Luo K., Levine S. (2018) “Learning Robust Rewards with Adversarial Inverse Reinforcement Learning” (AIRL). ICLR.
de Haan P., Jayaraman D., Levine S. (2019) “Causal Confusion in Imitation Learning.” NeurIPS.
Brohan A. et al. (2022) “RT-1: Robotics Transformer for Real-World Control at Scale.” arXiv:2212.06817.
Brohan A. et al. (2023) “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv:2307.15818.
Padalkar A. et al. (2023) “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv:2310.08864.
Kim M.J. et al. (2024) “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv:2406.09246.
Physical Intelligence (2024) “π0: A Vision-Language-Action Flow Model for General Robot Control.” physicalintelligence.company.
Octo Team (2024) “Octo: An Open-Source Generalist Robot Policy.” arXiv:2405.12213.
Chi C., Feng S., Du Y., et al. (2023) “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS.
Zhao T., Kumar V., Levine S., Finn C. (2023) “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” (ACT). RSS.
Fu Z. et al. (2024) “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” arXiv:2401.02117.
Lipman Y. et al. (2023) “Flow Matching for Generative Modeling.” ICLR.
Hu E.J. et al. (2021) “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv:2106.09685.
Walke H. et al. (2023) “BridgeData V2: A Dataset for Robot Learning at Scale.” CoRL.
Khazatsky A. et al. (2024) “DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset.” arXiv:2403.12945.
NVIDIA (2025) “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.” nvidia.com.
Figure AI (2025) “Helix: A Vision-Language-Action Model for Humanoid Robots.” figure.ai blog.

Compendium

Explorer

Imitation Learning — LfD, Diffusion Policies, VLA Models for Robotics

Imitation Learning — LfD, Diffusion Policies, VLA Models for Robotics

See also

1. At a glance

2. First principles

2.1 Behavior cloning (BC)

2.2 DAgger (Ross 2011)

2.3 Inverse reinforcement learning (IRL)

2.4 Multi-modal policies: GMM, MDN, diffusion, VQ

2.5 Diffusion policy (Chi et al. 2023)

2.6 Action chunking (ACT, Zhao 2023)

2.7 Vision-Language-Action models (VLAs)

2.8 Cross-embodiment learning

3. Practical math — dataset sizes, training compute, inference latency

3.1 Dataset-scaling intuition

3.2 Worked example: data budget for a tabletop sorting task

3.3 Compounding error bound (Ross 2010)

3.4 Diffusion policy denoising-step budget

3.5 LoRA fine-tuning footprint

4. Design heuristics

5. Components & sourcing — teleop rigs and data engines

5.1 Teleoperation rigs

5.2 Datasets

5.3 Open-source policy implementations

5.4 Foundation model APIs / checkpoints

6. Reference data

6.1 Method comparison

6.2 Notable IL milestones

6.3 Practical training recipes

7. Failure modes & debugging

7.1 Quick-debug checklist for new BC policy

7.2 Comparative metrics

8. Case studies

8.1 ALOHA folding shirts (Zhao 2023, Stanford)

8.2 Mobile ALOHA cooking (Fu 2024, Stanford)

8.3 Diffusion Policy on robomimic (Chi 2023)

8.4 π0 zero-shot generalization (Physical Intelligence 2024)

8.5 Figure Helix on Figure 02 (Feb 2025)

8.6 NVIDIA GR00T (2025)

8.7 1X NEO Gamma at homes (2024–2025)

8.8 Open-X-Embodiment / RT-X (2023)

8.9 LeRobot — Hugging Face’s open IL stack (2024)

8.10 Diffusion Policy on bin picking (Mahler / Berkeley AUTOLAB 2024)

Adjacent

Citations

Graph View

Table of Contents