Bin Picking — Dex-Net, FoundationPose, Grasp Planning

1. At a glance

Random bin picking is the task of grasping and removing parts from a jumbled, unstructured bin — long held as the “holy grail” of warehouse and factory automation because it combines perception under heavy clutter, grasp synthesis on unseen object instances, and reliable execution at industrial cycle times. The modern (2026) wave is dominated by:

  • Deep-learning grasp planners — Dex-Net 2.0/2.1 (Mahler 2017), GraspNet-1Billion (Fang 2020), AnyGrasp (Fang 2023), Contact-GraspNet (Sundermeyer 2021).
  • Foundation pose models — FoundationPose (Wen 2024, NVIDIA), MegaPose (Labbé 2022), NVIDIA cuGripper + Isaac Manipulator stack (2024).
  • End-effector diversity — vacuum suction, parallel-jaw, multi-finger soft grippers, hybrid tool-changers.
  • Major deployments — Amazon Robotics (Sparrow, Robin, Cardinal), Symbotic (gantry AS/RS), Berkshire Grey, AutoStore, Locus Robotics, Covariant (acquired by Amazon 2024), RightHand Robotics, MUJIN.

2. Why it matters

E-commerce fulfillment, manufacturing kitting, and parcel induction are the dominant economic drivers. Amazon disclosed picking over 10 billion items/year robotically in its 2024 operations summary. Per-pick economics are driven by three coupled metrics:

  • Speed — target 10–20 picks/min for industrial cells; Sparrow operates near 15 PPM at peak.
  • Reliability — pick success ≥ 99% in tote-to-tote, ≥ 95% in mixed-SKU clutter.
  • Damage rate — typically < 0.1% to stay competitive with manual labor.

3. First principles

3.1 Grasp quality metrics

MetricDefinitionUse
Force-closureGrasp can resist any external wrench in arbitrary directionMulti-finger feasibility
AntipodalTwo parallel-jaw contacts whose normals are colinear and opposedParallel-jaw scoring
Wrench-resistance ratioε = min‖w‖ over disturbance setRobust grasp ranking
Manipulability√det(JJᵀ) at grasp poseArm-configuration filter
ReachabilityPose is inside arm workspace, no joint-limit violationHard feasibility
Q-value (Dex-Net)P(success) under sensor + dynamics noiseLearned ranking

3.2 Suction physics

Suction-cup lifting force on a sealed pad:

F = (P_atm − P_vac) × A_seal

with P_atm ≈ 101.3 kPa, P_vac the absolute pressure inside the cup, and A_seal the effective sealed area. A “70 % vacuum” cup pulls P_vac ≈ 30 kPa absolute, giving ΔP ≈ 71 kPa.

3.3 Vision pipeline (canonical)

RGB-D capture → instance segmentation → 6-DoF pose estimation
              → grasp planner → motion plan (MoveIt 2 / cuMotion)
              → execute → in-hand verification → place

3.4 Open-vocabulary grasping

Language-conditioned pick (“pick up the red mug”) is enabled by combining:

  • SAM / SAM 2 (Meta 2024) for promptable segmentation.
  • CLIP (Radford 2021) for vision-language matching.
  • LLM planner (GPT-4o, Claude, RT-2) to parse the natural-language request into a grasp target.

3.5 Sim-to-real

Two foundational papers drive industrial sim-to-real for grasping:

  • Tobin 2017 IROS — domain randomization (textures, lighting, camera pose) to train detectors in simulation that transfer to real cameras with no real-data fine-tuning.
  • Mahler 2017 RSS Dex-Net 2.0 — synthetic depth images of 6.7 M grasps over 1500 3D models; GQ-CNN trained purely in simulation, deployed without real fine-tune at ≈ 90 % success on Adversarial Objects.

4. Worked examples

Example A — Suction-cup sizing for a 4 kg parcel

Given: parcel mass m = 4 kg, target vacuum 70 % (P_vac ≈ 30 kPa abs), bellows cup outer Ø 60 mm.

A_seal ≈ π × (0.030 m)² = 2.83 × 10⁻³ m² ΔP = 101.3 kPa − 30 kPa = 71.3 kPa = 7.13 × 10⁴ Pa F_lift = ΔP × A_seal = 7.13 × 10⁴ × 2.83 × 10⁻³ ≈ 202 N

Apply 2× safety factor → F_eff ≈ 100 N. Weight required: W = m·g = 4 × 9.81 = 39.2 N. F_eff ≫ W, so the cup is sized correctly with margin for acceleration during the lift trajectory (typical 0.3 g lateral).

Example B — Dex-Net 2.0 pipeline on a Jetson Orin

Input: 640×480 depth image (Photoneo PhoXi or D455). Process: GQ-CNN ranks ~1000 antipodal candidates → top-k=10 returned. Latency: ≈ 200 ms end-to-end on Jetson Orin AGX (40 TOPS INT8). Outcome: top-1 grasp success ≈ 90 % on the Berkeley Adversarial Objects benchmark; ≈ 80 % on long-tail SKU mixes when augmented with Dex-Net 2.1 suction predictions.

Example C — FoundationPose 6-DoF on unseen geometry

Inputs: RGB-D frame + object CAD mesh (or a few reference views in model-free mode). Outputs: 6-DoF pose (R, t) ∈ SE(3) plus confidence score. Benchmark: BOP Challenge 2024 — FoundationPose wins both model-based and model-free tracks on YCB-V, LM-O, T-LESS. Inference ≈ 30 ms on RTX 4090, ≈ 120 ms on Jetson Orin.

5. Vision pipelines

5.1 Classical pipeline (still common in MUJIN, Photoneo BPS)

  1. Structured-light or active-stereo depth capture.
  2. Statistical outlier removal (PCL).
  3. Cluster segmentation (Euclidean / region-growing).
  4. CAD-driven ICP for 6-DoF pose.
  5. Heuristic grasp generation on object frame.

5.2 Deep-learning 2D detection + ICP refine

  • Mask R-CNN (He 2017) or YOLOv8-Seg for instance masks.
  • SAM 2 (Ravi 2024) for promptable refinement.
  • ICP from CAD to masked point cloud.

5.3 Deep-learning grasp synthesis

MethodYearOutputNotes
Dex-Net 2.0 (Mahler)2017Antipodal grasp + QSynthetic-only training, depth in
Dex-Net 2.12018+ Suction graspMulti-modal scoring
GraspNet baseline (Fang)20206-DoF grasp set1B-grasp synthetic dataset
Contact-GraspNet (Sundermeyer)2021Per-point 6-DoFNVIDIA Isaac
AnyGrasp (Fang)2023Dense 6-DoF in clutterReal-time, SOTA on GraspNet bench

5.4 6-DoF pose estimation

MethodYearInputNotes
PoseCNN (Xiang)2018RGBFirst end-to-end 6-DoF CNN
DenseFusion (Wang)2019RGB-DIterative refinement
MegaPose (Labbé)2022RGB + CADZero-shot on unseen objects
FoundationPose (Wen)2024RGB-D + CAD/ref viewsNVIDIA, BOP-2024 winner

5.5 Promptable / open-vocabulary

  • SAM 2 + CLIP — Mask + language grounding for pick targets.
  • OWL-ViT (Google 2022) — Open-vocabulary detection.
  • RT-2 (Brohan 2023) — Vision-language-action transformer that emits gripper-pose tokens.

6. Hardware and integrations

6.1 3D cameras

SensorModalityRangeNotes
Intel RealSense D435iActive stereo + IMU0.3–3 mCheap, ubiquitous, noisy on dark/translucent
Intel RealSense D455Active stereo, wide FOV0.4–6 mBetter baseline (95 mm)
Intel RealSense D456IP65 industrial0.4–6 mSealed for factory floor
Photoneo PhoXi 3D ScannerStructured light, projected0.5–2.2 mIndustrial bin-pick standard
Photoneo MotionCam-3DStructured light, dynamic0.5–2 mCaptures while moving
Zivid 2 / 2+Structured light0.3–2 mSub-mm accuracy, slow
Stereolabs ZED 2iPassive stereo + IMU0.3–20 mOutdoor capable
Orbbec Femto BoltToF0.25–5 mAzure Kinect successor

6.2 End-effectors

GripperTypePayloadNotes
Robotiq 2F-852-finger parallel5 kgUR/Fanuc/KUKA plug-and-play
Robotiq 2F-1402-finger parallel, wide2.5 kgLarger envelope
Robotiq EPickVacuum, 1 cup7 kgBuilt-in pump
OnRobot RG2 / RG62-finger parallel2 / 6 kgAdjustable stroke
OnRobot VG10Vacuum, 10 zones15 kgMulti-zone for variable items
Schunk EGP / EGKIndustrial parallel0.5–10 kgLong-cycle reliability
Soft Robotics mGripCompliant fingers1–3 kgFood / fragile
SAKE EZGripperUnderactuated finger1 kgResearch
Festo / Piab vacuum bellowsMulti-cup arrays1–30 kgParcel induction

6.3 Robot arms

ArmReachPayloadNotes
UR5e850 mm5 kgMost common pick-cell arm
UR10e1300 mm12.5 kgLarger totes
UR201750 mm20 kgHeavy parcel induction
Fanuc CRX-10iA1249 mm10 kgCollaborative
Fanuc M-20iD/351813 mm35 kgIndustrial pick stations
KUKA Agilus KR 6 R900901 mm6 kgFast, high-acceleration
ABB IRB 1200901 mm7 kgCompact, IP67 option
Yaskawa GP251730 mm25 kgHigh-speed pick

6.4 Software stacks

  • Covariant Brain (now Symbotic AI after the 2024 acquisition of Covariant’s leadership and IP by Amazon) — foundation manipulation model.
  • RightHand RightPick3 — turnkey piece-picking station.
  • MUJIN MUJIN Controller — motion planner + bin-pick stack, originated in Pittsburgh / Tokyo.
  • Berkshire Grey AI — gantry pick stations.
  • OSARO SightWorks — vision + grasp planner.
  • Photoneo Bin Picking Studio — CAD-driven classical pipeline with deep-learning add-ons.

7. Industry deployments

OperatorSystemRoleYear
Amazon RoboticsSparrowMixed-SKU tote-to-tote2022 reveal, 2023+ deploy
Amazon RoboticsRobinParcel induction (UR10 + suction)2021+
Amazon RoboticsCardinalPackage sortation2022+
SymboticGantry AS/RS + pickWalmart, Albertsons2020+
Berkshire GreyGantry pick stationsFedEx, Walmart2018+
AutoStoreCube AS/RS + pick add-onMulti-tenant2018+
Locus RoboticsAMR + pick stationThird-party logistics2017+
Covariant (Amazon 2024)Brain foundation modelABB / KNAPP / Pitney Bowes2017–2024
RightHand RoboticsRightPick3Pharma + apparel kitting2019+
MUJINPick workerFast Retailing / Uniqlo2018+

8. Edge cases and failure modes

  • Translucent / reflective objects — clear plastic clamshells and mirrored items defeat IR active-stereo and structured-light; mitigations include polarization (Photoneo), neural shape priors (NeRF-based), and tactile retry.
  • Heavy items (> 2 kg) — exceed the wrist torque budget of typical cobot arms; tool design must offset the center of mass close to the wrist.
  • Cluttered + occluded targets — top-k grasp planners need clutter-aware re-ranking (AnyGrasp explicitly models occlusion).
  • Bin walls + collision avoidance — cuMotion / TrajOpt / OMPL must include bin-wall meshes; a 1 cm safety inflation is typical.
  • Slip during pick — gripper geometry mismatch with object surface; tactile sensing (GelSight, Contactile PapillArray) closes the loop.
  • Damage to fragile items — over-force; impedance control + soft grippers (Soft Robotics mGrip).
  • Cycle time — 3 s/pick = 20 PPM is a common industrial target; Sparrow runs ~15 PPM, Symbotic gantries can hit 30+ PPM on parcels.
  • Multi-item pick — accidental double-pick; weight-sensing or in-hand vision verification needed for inventory accuracy.
  • Empty-bin detection — easily handled with depth statistics; matters for cycle-end signaling.
  • Lighting changes — defeated by active illumination (Photoneo) or domain randomization in training.
  • Suction failure on porous / perforated items — burlap, mesh bags, perforated boxes; fall back to parallel-jaw or multi-finger.

9. Tools and software stacks

9.1 Open source

  • GQ-CNN (Berkeley AUTOLab) — reference Dex-Net implementation.
  • GraspNet-1Billion baseline (Fang 2020) — public PyTorch repo and dataset.
  • Acronym dataset (Eppner NVIDIA 2021) — 17 M grasps across 8872 objects.
  • Contact-GraspNet (NVIDIA 2021) — public.
  • PCL (Point Cloud Library) — ICP, segmentation.
  • MoveIt 2 + ROS 2 — motion planning with GraspNet plugin.

9.2 Commercial

  • MUJIN Motion Planner + Pick Station — turnkey.
  • Photoneo Bin Picking Studio — pairs with PhoXi scanners.
  • Pickit 3D — Belgian Photoneo competitor.
  • Solomon AccuPick 3D — Taiwan; deep-learning grasping.
  • NVIDIA Isaac Manipulator (2024) — FoundationPose + cuMotion + cuGripper, all GPU-accelerated.

9.3 Datasets

DatasetYearSizeUse
Dex-Net 2.020176.7 M synthetic graspsParallel-jaw training
YCB-Video201892 vids, 21 objects6-DoF pose
GraspNet-1Billion20201 B grasps, 97 k images6-DoF grasp
Acronym202117 M grasps, 8872 objectsSim grasp
BOP Challenge2017–2024MultiplePose benchmark

10. Case studies

10.1 Amazon Sparrow (2022 reveal, 2023+ deploys)

Vacuum and parallel-jaw multi-modal end-effector on a 6-DoF arm. Disclosed to handle ~65 % of Amazon’s inventory diversity at peak. Vision is RGB-D plus learned grasp planning; the system explicitly trades raw speed for SKU coverage relative to dedicated parcel-induction cells like Robin.

10.2 Berkshire Grey at FedEx

Gantry-based pick stations for large and small parcel handling. The gantry geometry sidesteps the wrist-payload limit of articulated arms and allows direct top-down picks at sustained 20–30 PPM. Integrated with AGVs for tote routing.

10.3 Covariant Brain → Amazon (RSS 2022 + acquisition 2024)

Covariant pioneered a foundation manipulation model trained across deployments with ABB, KNAPP, Pitney Bowes, and others. The 2024 acquisition by Amazon brought the model and its lead researchers (Pieter Abbeel, Peter Chen, Rocky Duan, Tianhao Zhang) into Amazon Robotics. The architecture follows the pattern: vision-language transformer pretraining + manipulation-specific fine-tuning on millions of bin-picks.

11. Cross-references

12. Citations

  • Mahler et al. 2017, “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics,” RSS.
  • Mahler et al. 2018, “Dex-Net 3.0: Computing Robust Robot Suction Grasp Targets,” ICRA.
  • Fang et al. 2020, “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping,” CVPR.
  • Fang et al. 2023, “AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains,” IEEE T-RO.
  • Sundermeyer et al. 2021, “Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes,” ICRA.
  • Wen et al. 2024, “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects,” CVPR.
  • Labbé et al. 2022, “MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare,” CoRL.
  • Xiang et al. 2018, “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,” RSS.
  • Wang et al. 2019, “DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion,” CVPR.
  • Goldberg, Mahler et al., bin-picking and Dex-Net series, IROS / ICRA / RSS 2015–2023.
  • Mason 2001, “Mechanics of Robotic Manipulation,” MIT Press.
  • Bicchi & Kumar 2000, “Robotic Grasping and Contact: A Review,” ICRA.
  • Salisbury 1985, “Kinematic and Force Analysis of Articulated Hands,” MIT PhD thesis.
  • Brohan et al. 2023, “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” CoRL.
  • Tobin et al. 2017, “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World,” IROS.
  • Bohg, Morales, Asfour & Kragic 2014, “Data-Driven Grasp Synthesis — A Survey,” IEEE T-RO.
  • Eppner, Mousavian & Fox 2021, “ACRONYM: A Large-Scale Grasp Dataset Based on Simulation,” ICRA.
  • Ravi et al. 2024, “SAM 2: Segment Anything in Images and Videos,” Meta AI.
  • Radford et al. 2021, “Learning Transferable Visual Models From Natural Language Supervision (CLIP),” ICML.
  • He et al. 2017, “Mask R-CNN,” ICCV.