Bin Picking — Dex-Net, FoundationPose, Grasp Planning
1. At a glance
Random bin picking is the task of grasping and removing parts from a jumbled, unstructured bin — long held as the “holy grail” of warehouse and factory automation because it combines perception under heavy clutter, grasp synthesis on unseen object instances, and reliable execution at industrial cycle times. The modern (2026) wave is dominated by:
- Deep-learning grasp planners — Dex-Net 2.0/2.1 (Mahler 2017), GraspNet-1Billion (Fang 2020), AnyGrasp (Fang 2023), Contact-GraspNet (Sundermeyer 2021).
- Foundation pose models — FoundationPose (Wen 2024, NVIDIA), MegaPose (Labbé 2022), NVIDIA cuGripper + Isaac Manipulator stack (2024).
- End-effector diversity — vacuum suction, parallel-jaw, multi-finger soft grippers, hybrid tool-changers.
- Major deployments — Amazon Robotics (Sparrow, Robin, Cardinal), Symbotic (gantry AS/RS), Berkshire Grey, AutoStore, Locus Robotics, Covariant (acquired by Amazon 2024), RightHand Robotics, MUJIN.
2. Why it matters
E-commerce fulfillment, manufacturing kitting, and parcel induction are the dominant economic drivers. Amazon disclosed picking over 10 billion items/year robotically in its 2024 operations summary. Per-pick economics are driven by three coupled metrics:
- Speed — target 10–20 picks/min for industrial cells; Sparrow operates near 15 PPM at peak.
- Reliability — pick success ≥ 99% in tote-to-tote, ≥ 95% in mixed-SKU clutter.
- Damage rate — typically < 0.1% to stay competitive with manual labor.
3. First principles
3.1 Grasp quality metrics
| Metric | Definition | Use |
|---|---|---|
| Force-closure | Grasp can resist any external wrench in arbitrary direction | Multi-finger feasibility |
| Antipodal | Two parallel-jaw contacts whose normals are colinear and opposed | Parallel-jaw scoring |
| Wrench-resistance ratio | ε = min‖w‖ over disturbance set | Robust grasp ranking |
| Manipulability | √det(JJᵀ) at grasp pose | Arm-configuration filter |
| Reachability | Pose is inside arm workspace, no joint-limit violation | Hard feasibility |
| Q-value (Dex-Net) | P(success) under sensor + dynamics noise | Learned ranking |
3.2 Suction physics
Suction-cup lifting force on a sealed pad:
F = (P_atm − P_vac) × A_seal
with P_atm ≈ 101.3 kPa, P_vac the absolute pressure inside the cup, and A_seal the effective sealed area. A “70 % vacuum” cup pulls P_vac ≈ 30 kPa absolute, giving ΔP ≈ 71 kPa.
3.3 Vision pipeline (canonical)
RGB-D capture → instance segmentation → 6-DoF pose estimation
→ grasp planner → motion plan (MoveIt 2 / cuMotion)
→ execute → in-hand verification → place
3.4 Open-vocabulary grasping
Language-conditioned pick (“pick up the red mug”) is enabled by combining:
- SAM / SAM 2 (Meta 2024) for promptable segmentation.
- CLIP (Radford 2021) for vision-language matching.
- LLM planner (GPT-4o, Claude, RT-2) to parse the natural-language request into a grasp target.
3.5 Sim-to-real
Two foundational papers drive industrial sim-to-real for grasping:
- Tobin 2017 IROS — domain randomization (textures, lighting, camera pose) to train detectors in simulation that transfer to real cameras with no real-data fine-tuning.
- Mahler 2017 RSS Dex-Net 2.0 — synthetic depth images of 6.7 M grasps over 1500 3D models; GQ-CNN trained purely in simulation, deployed without real fine-tune at ≈ 90 % success on Adversarial Objects.
4. Worked examples
Example A — Suction-cup sizing for a 4 kg parcel
Given: parcel mass m = 4 kg, target vacuum 70 % (P_vac ≈ 30 kPa abs), bellows cup outer Ø 60 mm.
A_seal ≈ π × (0.030 m)² = 2.83 × 10⁻³ m² ΔP = 101.3 kPa − 30 kPa = 71.3 kPa = 7.13 × 10⁴ Pa F_lift = ΔP × A_seal = 7.13 × 10⁴ × 2.83 × 10⁻³ ≈ 202 N
Apply 2× safety factor → F_eff ≈ 100 N. Weight required: W = m·g = 4 × 9.81 = 39.2 N. F_eff ≫ W, so the cup is sized correctly with margin for acceleration during the lift trajectory (typical 0.3 g lateral).
Example B — Dex-Net 2.0 pipeline on a Jetson Orin
Input: 640×480 depth image (Photoneo PhoXi or D455). Process: GQ-CNN ranks ~1000 antipodal candidates → top-k=10 returned. Latency: ≈ 200 ms end-to-end on Jetson Orin AGX (40 TOPS INT8). Outcome: top-1 grasp success ≈ 90 % on the Berkeley Adversarial Objects benchmark; ≈ 80 % on long-tail SKU mixes when augmented with Dex-Net 2.1 suction predictions.
Example C — FoundationPose 6-DoF on unseen geometry
Inputs: RGB-D frame + object CAD mesh (or a few reference views in model-free mode). Outputs: 6-DoF pose (R, t) ∈ SE(3) plus confidence score. Benchmark: BOP Challenge 2024 — FoundationPose wins both model-based and model-free tracks on YCB-V, LM-O, T-LESS. Inference ≈ 30 ms on RTX 4090, ≈ 120 ms on Jetson Orin.
5. Vision pipelines
5.1 Classical pipeline (still common in MUJIN, Photoneo BPS)
- Structured-light or active-stereo depth capture.
- Statistical outlier removal (PCL).
- Cluster segmentation (Euclidean / region-growing).
- CAD-driven ICP for 6-DoF pose.
- Heuristic grasp generation on object frame.
5.2 Deep-learning 2D detection + ICP refine
- Mask R-CNN (He 2017) or YOLOv8-Seg for instance masks.
- SAM 2 (Ravi 2024) for promptable refinement.
- ICP from CAD to masked point cloud.
5.3 Deep-learning grasp synthesis
| Method | Year | Output | Notes |
|---|---|---|---|
| Dex-Net 2.0 (Mahler) | 2017 | Antipodal grasp + Q | Synthetic-only training, depth in |
| Dex-Net 2.1 | 2018 | + Suction grasp | Multi-modal scoring |
| GraspNet baseline (Fang) | 2020 | 6-DoF grasp set | 1B-grasp synthetic dataset |
| Contact-GraspNet (Sundermeyer) | 2021 | Per-point 6-DoF | NVIDIA Isaac |
| AnyGrasp (Fang) | 2023 | Dense 6-DoF in clutter | Real-time, SOTA on GraspNet bench |
5.4 6-DoF pose estimation
| Method | Year | Input | Notes |
|---|---|---|---|
| PoseCNN (Xiang) | 2018 | RGB | First end-to-end 6-DoF CNN |
| DenseFusion (Wang) | 2019 | RGB-D | Iterative refinement |
| MegaPose (Labbé) | 2022 | RGB + CAD | Zero-shot on unseen objects |
| FoundationPose (Wen) | 2024 | RGB-D + CAD/ref views | NVIDIA, BOP-2024 winner |
5.5 Promptable / open-vocabulary
- SAM 2 + CLIP — Mask + language grounding for pick targets.
- OWL-ViT (Google 2022) — Open-vocabulary detection.
- RT-2 (Brohan 2023) — Vision-language-action transformer that emits gripper-pose tokens.
6. Hardware and integrations
6.1 3D cameras
| Sensor | Modality | Range | Notes |
|---|---|---|---|
| Intel RealSense D435i | Active stereo + IMU | 0.3–3 m | Cheap, ubiquitous, noisy on dark/translucent |
| Intel RealSense D455 | Active stereo, wide FOV | 0.4–6 m | Better baseline (95 mm) |
| Intel RealSense D456 | IP65 industrial | 0.4–6 m | Sealed for factory floor |
| Photoneo PhoXi 3D Scanner | Structured light, projected | 0.5–2.2 m | Industrial bin-pick standard |
| Photoneo MotionCam-3D | Structured light, dynamic | 0.5–2 m | Captures while moving |
| Zivid 2 / 2+ | Structured light | 0.3–2 m | Sub-mm accuracy, slow |
| Stereolabs ZED 2i | Passive stereo + IMU | 0.3–20 m | Outdoor capable |
| Orbbec Femto Bolt | ToF | 0.25–5 m | Azure Kinect successor |
6.2 End-effectors
| Gripper | Type | Payload | Notes |
|---|---|---|---|
| Robotiq 2F-85 | 2-finger parallel | 5 kg | UR/Fanuc/KUKA plug-and-play |
| Robotiq 2F-140 | 2-finger parallel, wide | 2.5 kg | Larger envelope |
| Robotiq EPick | Vacuum, 1 cup | 7 kg | Built-in pump |
| OnRobot RG2 / RG6 | 2-finger parallel | 2 / 6 kg | Adjustable stroke |
| OnRobot VG10 | Vacuum, 10 zones | 15 kg | Multi-zone for variable items |
| Schunk EGP / EGK | Industrial parallel | 0.5–10 kg | Long-cycle reliability |
| Soft Robotics mGrip | Compliant fingers | 1–3 kg | Food / fragile |
| SAKE EZGripper | Underactuated finger | 1 kg | Research |
| Festo / Piab vacuum bellows | Multi-cup arrays | 1–30 kg | Parcel induction |
6.3 Robot arms
| Arm | Reach | Payload | Notes |
|---|---|---|---|
| UR5e | 850 mm | 5 kg | Most common pick-cell arm |
| UR10e | 1300 mm | 12.5 kg | Larger totes |
| UR20 | 1750 mm | 20 kg | Heavy parcel induction |
| Fanuc CRX-10iA | 1249 mm | 10 kg | Collaborative |
| Fanuc M-20iD/35 | 1813 mm | 35 kg | Industrial pick stations |
| KUKA Agilus KR 6 R900 | 901 mm | 6 kg | Fast, high-acceleration |
| ABB IRB 1200 | 901 mm | 7 kg | Compact, IP67 option |
| Yaskawa GP25 | 1730 mm | 25 kg | High-speed pick |
6.4 Software stacks
- Covariant Brain (now Symbotic AI after the 2024 acquisition of Covariant’s leadership and IP by Amazon) — foundation manipulation model.
- RightHand RightPick3 — turnkey piece-picking station.
- MUJIN MUJIN Controller — motion planner + bin-pick stack, originated in Pittsburgh / Tokyo.
- Berkshire Grey AI — gantry pick stations.
- OSARO SightWorks — vision + grasp planner.
- Photoneo Bin Picking Studio — CAD-driven classical pipeline with deep-learning add-ons.
7. Industry deployments
| Operator | System | Role | Year |
|---|---|---|---|
| Amazon Robotics | Sparrow | Mixed-SKU tote-to-tote | 2022 reveal, 2023+ deploy |
| Amazon Robotics | Robin | Parcel induction (UR10 + suction) | 2021+ |
| Amazon Robotics | Cardinal | Package sortation | 2022+ |
| Symbotic | Gantry AS/RS + pick | Walmart, Albertsons | 2020+ |
| Berkshire Grey | Gantry pick stations | FedEx, Walmart | 2018+ |
| AutoStore | Cube AS/RS + pick add-on | Multi-tenant | 2018+ |
| Locus Robotics | AMR + pick station | Third-party logistics | 2017+ |
| Covariant (Amazon 2024) | Brain foundation model | ABB / KNAPP / Pitney Bowes | 2017–2024 |
| RightHand Robotics | RightPick3 | Pharma + apparel kitting | 2019+ |
| MUJIN | Pick worker | Fast Retailing / Uniqlo | 2018+ |
8. Edge cases and failure modes
- Translucent / reflective objects — clear plastic clamshells and mirrored items defeat IR active-stereo and structured-light; mitigations include polarization (Photoneo), neural shape priors (NeRF-based), and tactile retry.
- Heavy items (> 2 kg) — exceed the wrist torque budget of typical cobot arms; tool design must offset the center of mass close to the wrist.
- Cluttered + occluded targets — top-k grasp planners need clutter-aware re-ranking (AnyGrasp explicitly models occlusion).
- Bin walls + collision avoidance — cuMotion / TrajOpt / OMPL must include bin-wall meshes; a 1 cm safety inflation is typical.
- Slip during pick — gripper geometry mismatch with object surface; tactile sensing (GelSight, Contactile PapillArray) closes the loop.
- Damage to fragile items — over-force; impedance control + soft grippers (Soft Robotics mGrip).
- Cycle time — 3 s/pick = 20 PPM is a common industrial target; Sparrow runs ~15 PPM, Symbotic gantries can hit 30+ PPM on parcels.
- Multi-item pick — accidental double-pick; weight-sensing or in-hand vision verification needed for inventory accuracy.
- Empty-bin detection — easily handled with depth statistics; matters for cycle-end signaling.
- Lighting changes — defeated by active illumination (Photoneo) or domain randomization in training.
- Suction failure on porous / perforated items — burlap, mesh bags, perforated boxes; fall back to parallel-jaw or multi-finger.
9. Tools and software stacks
9.1 Open source
- GQ-CNN (Berkeley AUTOLab) — reference Dex-Net implementation.
- GraspNet-1Billion baseline (Fang 2020) — public PyTorch repo and dataset.
- Acronym dataset (Eppner NVIDIA 2021) — 17 M grasps across 8872 objects.
- Contact-GraspNet (NVIDIA 2021) — public.
- PCL (Point Cloud Library) — ICP, segmentation.
- MoveIt 2 + ROS 2 — motion planning with GraspNet plugin.
9.2 Commercial
- MUJIN Motion Planner + Pick Station — turnkey.
- Photoneo Bin Picking Studio — pairs with PhoXi scanners.
- Pickit 3D — Belgian Photoneo competitor.
- Solomon AccuPick 3D — Taiwan; deep-learning grasping.
- NVIDIA Isaac Manipulator (2024) — FoundationPose + cuMotion + cuGripper, all GPU-accelerated.
9.3 Datasets
| Dataset | Year | Size | Use |
|---|---|---|---|
| Dex-Net 2.0 | 2017 | 6.7 M synthetic grasps | Parallel-jaw training |
| YCB-Video | 2018 | 92 vids, 21 objects | 6-DoF pose |
| GraspNet-1Billion | 2020 | 1 B grasps, 97 k images | 6-DoF grasp |
| Acronym | 2021 | 17 M grasps, 8872 objects | Sim grasp |
| BOP Challenge | 2017–2024 | Multiple | Pose benchmark |
10. Case studies
10.1 Amazon Sparrow (2022 reveal, 2023+ deploys)
Vacuum and parallel-jaw multi-modal end-effector on a 6-DoF arm. Disclosed to handle ~65 % of Amazon’s inventory diversity at peak. Vision is RGB-D plus learned grasp planning; the system explicitly trades raw speed for SKU coverage relative to dedicated parcel-induction cells like Robin.
10.2 Berkshire Grey at FedEx
Gantry-based pick stations for large and small parcel handling. The gantry geometry sidesteps the wrist-payload limit of articulated arms and allows direct top-down picks at sustained 20–30 PPM. Integrated with AGVs for tote routing.
10.3 Covariant Brain → Amazon (RSS 2022 + acquisition 2024)
Covariant pioneered a foundation manipulation model trained across deployments with ABB, KNAPP, Pitney Bowes, and others. The 2024 acquisition by Amazon brought the model and its lead researchers (Pieter Abbeel, Peter Chen, Rocky Duan, Tianhao Zhang) into Amazon Robotics. The architecture follows the pattern: vision-language transformer pretraining + manipulation-specific fine-tuning on millions of bin-picks.
11. Cross-references
- end-effectors — suction, parallel-jaw, multi-finger gripper design.
- manipulator-design — arm kinematics, wrist payload, reach.
- computer-vision-robotics — 6-DoF pose estimation and segmentation.
- sensors-perception — RGB-D, structured light, ToF.
- path-planning — collision-free trajectory inside the bin.
- impedance-control — grasping force regulation, soft contact.
- rl-for-control — sim-to-real domain randomization for grasp policies.
- microfluidics — vacuum-cup geometry parallels low-Re flow.
12. Citations
- Mahler et al. 2017, “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics,” RSS.
- Mahler et al. 2018, “Dex-Net 3.0: Computing Robust Robot Suction Grasp Targets,” ICRA.
- Fang et al. 2020, “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping,” CVPR.
- Fang et al. 2023, “AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains,” IEEE T-RO.
- Sundermeyer et al. 2021, “Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes,” ICRA.
- Wen et al. 2024, “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects,” CVPR.
- Labbé et al. 2022, “MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare,” CoRL.
- Xiang et al. 2018, “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,” RSS.
- Wang et al. 2019, “DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion,” CVPR.
- Goldberg, Mahler et al., bin-picking and Dex-Net series, IROS / ICRA / RSS 2015–2023.
- Mason 2001, “Mechanics of Robotic Manipulation,” MIT Press.
- Bicchi & Kumar 2000, “Robotic Grasping and Contact: A Review,” ICRA.
- Salisbury 1985, “Kinematic and Force Analysis of Articulated Hands,” MIT PhD thesis.
- Brohan et al. 2023, “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” CoRL.
- Tobin et al. 2017, “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World,” IROS.
- Bohg, Morales, Asfour & Kragic 2014, “Data-Driven Grasp Synthesis — A Survey,” IEEE T-RO.
- Eppner, Mousavian & Fox 2021, “ACRONYM: A Large-Scale Grasp Dataset Based on Simulation,” ICRA.
- Ravi et al. 2024, “SAM 2: Segment Anything in Images and Videos,” Meta AI.
- Radford et al. 2021, “Learning Transferable Visual Models From Natural Language Supervision (CLIP),” ICML.
- He et al. 2017, “Mask R-CNN,” ICCV.