Bin Picking — Dex-Net, FoundationPose, Grasp Planning

1. At a glance

Random bin picking is the task of grasping and removing parts from a jumbled, unstructured bin — long held as the “holy grail” of warehouse and factory automation because it combines perception under heavy clutter, grasp synthesis on unseen object instances, and reliable execution at industrial cycle times. The modern (2026) wave is dominated by:

Deep-learning grasp planners — Dex-Net 2.0/2.1 (Mahler 2017), GraspNet-1Billion (Fang 2020), AnyGrasp (Fang 2023), Contact-GraspNet (Sundermeyer 2021).
Foundation pose models — FoundationPose (Wen 2024, NVIDIA), MegaPose (Labbé 2022), NVIDIA cuGripper + Isaac Manipulator stack (2024).
End-effector diversity — vacuum suction, parallel-jaw, multi-finger soft grippers, hybrid tool-changers.
Major deployments — Amazon Robotics (Sparrow, Robin, Cardinal), Symbotic (gantry AS/RS), Berkshire Grey, AutoStore, Locus Robotics, Covariant (acquired by Amazon 2024), RightHand Robotics, MUJIN.

2. Why it matters

E-commerce fulfillment, manufacturing kitting, and parcel induction are the dominant economic drivers. Amazon disclosed picking over 10 billion items/year robotically in its 2024 operations summary. Per-pick economics are driven by three coupled metrics:

Speed — target 10–20 picks/min for industrial cells; Sparrow operates near 15 PPM at peak.
Reliability — pick success ≥ 99% in tote-to-tote, ≥ 95% in mixed-SKU clutter.
Damage rate — typically < 0.1% to stay competitive with manual labor.

3. First principles

3.1 Grasp quality metrics

Metric	Definition	Use
Force-closure	Grasp can resist any external wrench in arbitrary direction	Multi-finger feasibility
Antipodal	Two parallel-jaw contacts whose normals are colinear and opposed	Parallel-jaw scoring
Wrench-resistance ratio	ε = min‖w‖ over disturbance set	Robust grasp ranking
Manipulability	√det(JJᵀ) at grasp pose	Arm-configuration filter
Reachability	Pose is inside arm workspace, no joint-limit violation	Hard feasibility
Q-value (Dex-Net)	P(success) under sensor + dynamics noise	Learned ranking

3.2 Suction physics

Suction-cup lifting force on a sealed pad:

F = (P_atm − P_vac) × A_seal

with P_atm ≈ 101.3 kPa, P_vac the absolute pressure inside the cup, and A_seal the effective sealed area. A “70 % vacuum” cup pulls P_vac ≈ 30 kPa absolute, giving ΔP ≈ 71 kPa.

3.3 Vision pipeline (canonical)

RGB-D capture → instance segmentation → 6-DoF pose estimation
              → grasp planner → motion plan (MoveIt 2 / cuMotion)
              → execute → in-hand verification → place

3.4 Open-vocabulary grasping

Language-conditioned pick (“pick up the red mug”) is enabled by combining:

SAM / SAM 2 (Meta 2024) for promptable segmentation.
CLIP (Radford 2021) for vision-language matching.
LLM planner (GPT-4o, Claude, RT-2) to parse the natural-language request into a grasp target.

3.5 Sim-to-real

Two foundational papers drive industrial sim-to-real for grasping:

Tobin 2017 IROS — domain randomization (textures, lighting, camera pose) to train detectors in simulation that transfer to real cameras with no real-data fine-tuning.
Mahler 2017 RSS Dex-Net 2.0 — synthetic depth images of 6.7 M grasps over 1500 3D models; GQ-CNN trained purely in simulation, deployed without real fine-tune at ≈ 90 % success on Adversarial Objects.

4. Worked examples

Example A — Suction-cup sizing for a 4 kg parcel

Given: parcel mass m = 4 kg, target vacuum 70 % (P_vac ≈ 30 kPa abs), bellows cup outer Ø 60 mm.

A_seal ≈ π × (0.030 m)² = 2.83 × 10⁻³ m² ΔP = 101.3 kPa − 30 kPa = 71.3 kPa = 7.13 × 10⁴ Pa F_lift = ΔP × A_seal = 7.13 × 10⁴ × 2.83 × 10⁻³ ≈ 202 N

Apply 2× safety factor → F_eff ≈ 100 N. Weight required: W = m·g = 4 × 9.81 = 39.2 N. F_eff ≫ W, so the cup is sized correctly with margin for acceleration during the lift trajectory (typical 0.3 g lateral).

Example B — Dex-Net 2.0 pipeline on a Jetson Orin

Input: 640×480 depth image (Photoneo PhoXi or D455). Process: GQ-CNN ranks ~1000 antipodal candidates → top-k=10 returned. Latency: ≈ 200 ms end-to-end on Jetson Orin AGX (40 TOPS INT8). Outcome: top-1 grasp success ≈ 90 % on the Berkeley Adversarial Objects benchmark; ≈ 80 % on long-tail SKU mixes when augmented with Dex-Net 2.1 suction predictions.

Example C — FoundationPose 6-DoF on unseen geometry

Inputs: RGB-D frame + object CAD mesh (or a few reference views in model-free mode). Outputs: 6-DoF pose (R, t) ∈ SE(3) plus confidence score. Benchmark: BOP Challenge 2024 — FoundationPose wins both model-based and model-free tracks on YCB-V, LM-O, T-LESS. Inference ≈ 30 ms on RTX 4090, ≈ 120 ms on Jetson Orin.

5. Vision pipelines

5.1 Classical pipeline (still common in MUJIN, Photoneo BPS)

Structured-light or active-stereo depth capture.
Statistical outlier removal (PCL).
Cluster segmentation (Euclidean / region-growing).
CAD-driven ICP for 6-DoF pose.
Heuristic grasp generation on object frame.

5.2 Deep-learning 2D detection + ICP refine

Mask R-CNN (He 2017) or YOLOv8-Seg for instance masks.
SAM 2 (Ravi 2024) for promptable refinement.
ICP from CAD to masked point cloud.

5.3 Deep-learning grasp synthesis

Method	Year	Output	Notes
Dex-Net 2.0 (Mahler)	2017	Antipodal grasp + Q	Synthetic-only training, depth in
Dex-Net 2.1	2018	+ Suction grasp	Multi-modal scoring
GraspNet baseline (Fang)	2020	6-DoF grasp set	1B-grasp synthetic dataset
Contact-GraspNet (Sundermeyer)	2021	Per-point 6-DoF	NVIDIA Isaac
AnyGrasp (Fang)	2023	Dense 6-DoF in clutter	Real-time, SOTA on GraspNet bench

5.4 6-DoF pose estimation

Method	Year	Input	Notes
PoseCNN (Xiang)	2018	RGB	First end-to-end 6-DoF CNN
DenseFusion (Wang)	2019	RGB-D	Iterative refinement
MegaPose (Labbé)	2022	RGB + CAD	Zero-shot on unseen objects
FoundationPose (Wen)	2024	RGB-D + CAD/ref views	NVIDIA, BOP-2024 winner

5.5 Promptable / open-vocabulary

SAM 2 + CLIP — Mask + language grounding for pick targets.
OWL-ViT (Google 2022) — Open-vocabulary detection.
RT-2 (Brohan 2023) — Vision-language-action transformer that emits gripper-pose tokens.

6. Hardware and integrations

6.1 3D cameras

Sensor	Modality	Range	Notes
Intel RealSense D435i	Active stereo + IMU	0.3–3 m	Cheap, ubiquitous, noisy on dark/translucent
Intel RealSense D455	Active stereo, wide FOV	0.4–6 m	Better baseline (95 mm)
Intel RealSense D456	IP65 industrial	0.4–6 m	Sealed for factory floor
Photoneo PhoXi 3D Scanner	Structured light, projected	0.5–2.2 m	Industrial bin-pick standard
Photoneo MotionCam-3D	Structured light, dynamic	0.5–2 m	Captures while moving
Zivid 2 / 2+	Structured light	0.3–2 m	Sub-mm accuracy, slow
Stereolabs ZED 2i	Passive stereo + IMU	0.3–20 m	Outdoor capable
Orbbec Femto Bolt	ToF	0.25–5 m	Azure Kinect successor

6.2 End-effectors

Gripper	Type	Payload	Notes
Robotiq 2F-85	2-finger parallel	5 kg	UR/Fanuc/KUKA plug-and-play
Robotiq 2F-140	2-finger parallel, wide	2.5 kg	Larger envelope
Robotiq EPick	Vacuum, 1 cup	7 kg	Built-in pump
OnRobot RG2 / RG6	2-finger parallel	2 / 6 kg	Adjustable stroke
OnRobot VG10	Vacuum, 10 zones	15 kg	Multi-zone for variable items
Schunk EGP / EGK	Industrial parallel	0.5–10 kg	Long-cycle reliability
Soft Robotics mGrip	Compliant fingers	1–3 kg	Food / fragile
SAKE EZGripper	Underactuated finger	1 kg	Research
Festo / Piab vacuum bellows	Multi-cup arrays	1–30 kg	Parcel induction

6.3 Robot arms

Arm	Reach	Payload	Notes
UR5e	850 mm	5 kg	Most common pick-cell arm
UR10e	1300 mm	12.5 kg	Larger totes
UR20	1750 mm	20 kg	Heavy parcel induction
Fanuc CRX-10iA	1249 mm	10 kg	Collaborative
Fanuc M-20iD/35	1813 mm	35 kg	Industrial pick stations
KUKA Agilus KR 6 R900	901 mm	6 kg	Fast, high-acceleration
ABB IRB 1200	901 mm	7 kg	Compact, IP67 option
Yaskawa GP25	1730 mm	25 kg	High-speed pick

6.4 Software stacks

Covariant Brain (now Symbotic AI after the 2024 acquisition of Covariant’s leadership and IP by Amazon) — foundation manipulation model.
RightHand RightPick3 — turnkey piece-picking station.
MUJIN MUJIN Controller — motion planner + bin-pick stack, originated in Pittsburgh / Tokyo.
Berkshire Grey AI — gantry pick stations.
OSARO SightWorks — vision + grasp planner.
Photoneo Bin Picking Studio — CAD-driven classical pipeline with deep-learning add-ons.

7. Industry deployments

Operator	System	Role	Year
Amazon Robotics	Sparrow	Mixed-SKU tote-to-tote	2022 reveal, 2023+ deploy
Amazon Robotics	Robin	Parcel induction (UR10 + suction)	2021+
Amazon Robotics	Cardinal	Package sortation	2022+
Symbotic	Gantry AS/RS + pick	Walmart, Albertsons	2020+
Berkshire Grey	Gantry pick stations	FedEx, Walmart	2018+
AutoStore	Cube AS/RS + pick add-on	Multi-tenant	2018+
Locus Robotics	AMR + pick station	Third-party logistics	2017+
Covariant (Amazon 2024)	Brain foundation model	ABB / KNAPP / Pitney Bowes	2017–2024
RightHand Robotics	RightPick3	Pharma + apparel kitting	2019+
MUJIN	Pick worker	Fast Retailing / Uniqlo	2018+

8. Edge cases and failure modes

Translucent / reflective objects — clear plastic clamshells and mirrored items defeat IR active-stereo and structured-light; mitigations include polarization (Photoneo), neural shape priors (NeRF-based), and tactile retry.
Heavy items (> 2 kg) — exceed the wrist torque budget of typical cobot arms; tool design must offset the center of mass close to the wrist.
Cluttered + occluded targets — top-k grasp planners need clutter-aware re-ranking (AnyGrasp explicitly models occlusion).
Bin walls + collision avoidance — cuMotion / TrajOpt / OMPL must include bin-wall meshes; a 1 cm safety inflation is typical.
Slip during pick — gripper geometry mismatch with object surface; tactile sensing (GelSight, Contactile PapillArray) closes the loop.
Damage to fragile items — over-force; impedance control + soft grippers (Soft Robotics mGrip).
Cycle time — 3 s/pick = 20 PPM is a common industrial target; Sparrow runs ~15 PPM, Symbotic gantries can hit 30+ PPM on parcels.
Multi-item pick — accidental double-pick; weight-sensing or in-hand vision verification needed for inventory accuracy.
Empty-bin detection — easily handled with depth statistics; matters for cycle-end signaling.
Lighting changes — defeated by active illumination (Photoneo) or domain randomization in training.
Suction failure on porous / perforated items — burlap, mesh bags, perforated boxes; fall back to parallel-jaw or multi-finger.

9. Tools and software stacks

9.1 Open source

GQ-CNN (Berkeley AUTOLab) — reference Dex-Net implementation.
GraspNet-1Billion baseline (Fang 2020) — public PyTorch repo and dataset.
Acronym dataset (Eppner NVIDIA 2021) — 17 M grasps across 8872 objects.
Contact-GraspNet (NVIDIA 2021) — public.
PCL (Point Cloud Library) — ICP, segmentation.
MoveIt 2 + ROS 2 — motion planning with GraspNet plugin.

9.2 Commercial

MUJIN Motion Planner + Pick Station — turnkey.
Photoneo Bin Picking Studio — pairs with PhoXi scanners.
Pickit 3D — Belgian Photoneo competitor.
Solomon AccuPick 3D — Taiwan; deep-learning grasping.
NVIDIA Isaac Manipulator (2024) — FoundationPose + cuMotion + cuGripper, all GPU-accelerated.

9.3 Datasets

Dataset	Year	Size	Use
Dex-Net 2.0	2017	6.7 M synthetic grasps	Parallel-jaw training
YCB-Video	2018	92 vids, 21 objects	6-DoF pose
GraspNet-1Billion	2020	1 B grasps, 97 k images	6-DoF grasp
Acronym	2021	17 M grasps, 8872 objects	Sim grasp
BOP Challenge	2017–2024	Multiple	Pose benchmark

10. Case studies

10.1 Amazon Sparrow (2022 reveal, 2023+ deploys)

Vacuum and parallel-jaw multi-modal end-effector on a 6-DoF arm. Disclosed to handle ~65 % of Amazon’s inventory diversity at peak. Vision is RGB-D plus learned grasp planning; the system explicitly trades raw speed for SKU coverage relative to dedicated parcel-induction cells like Robin.

10.2 Berkshire Grey at FedEx

Gantry-based pick stations for large and small parcel handling. The gantry geometry sidesteps the wrist-payload limit of articulated arms and allows direct top-down picks at sustained 20–30 PPM. Integrated with AGVs for tote routing.

10.3 Covariant Brain → Amazon (RSS 2022 + acquisition 2024)

Covariant pioneered a foundation manipulation model trained across deployments with ABB, KNAPP, Pitney Bowes, and others. The 2024 acquisition by Amazon brought the model and its lead researchers (Pieter Abbeel, Peter Chen, Rocky Duan, Tianhao Zhang) into Amazon Robotics. The architecture follows the pattern: vision-language transformer pretraining + manipulation-specific fine-tuning on millions of bin-picks.

11. Cross-references

end-effectors — suction, parallel-jaw, multi-finger gripper design.
manipulator-design — arm kinematics, wrist payload, reach.
computer-vision-robotics — 6-DoF pose estimation and segmentation.
sensors-perception — RGB-D, structured light, ToF.
path-planning — collision-free trajectory inside the bin.
impedance-control — grasping force regulation, soft contact.
rl-for-control — sim-to-real domain randomization for grasp policies.
microfluidics — vacuum-cup geometry parallels low-Re flow.

12. Citations

Mahler et al. 2017, “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics,” RSS.
Mahler et al. 2018, “Dex-Net 3.0: Computing Robust Robot Suction Grasp Targets,” ICRA.
Fang et al. 2020, “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping,” CVPR.
Fang et al. 2023, “AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains,” IEEE T-RO.
Sundermeyer et al. 2021, “Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes,” ICRA.
Wen et al. 2024, “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects,” CVPR.
Labbé et al. 2022, “MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare,” CoRL.
Xiang et al. 2018, “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,” RSS.
Wang et al. 2019, “DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion,” CVPR.
Goldberg, Mahler et al., bin-picking and Dex-Net series, IROS / ICRA / RSS 2015–2023.
Mason 2001, “Mechanics of Robotic Manipulation,” MIT Press.
Bicchi & Kumar 2000, “Robotic Grasping and Contact: A Review,” ICRA.
Salisbury 1985, “Kinematic and Force Analysis of Articulated Hands,” MIT PhD thesis.
Brohan et al. 2023, “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” CoRL.
Tobin et al. 2017, “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World,” IROS.
Bohg, Morales, Asfour & Kragic 2014, “Data-Driven Grasp Synthesis — A Survey,” IEEE T-RO.
Eppner, Mousavian & Fox 2021, “ACRONYM: A Large-Scale Grasp Dataset Based on Simulation,” ICRA.
Ravi et al. 2024, “SAM 2: Segment Anything in Images and Videos,” Meta AI.
Radford et al. 2021, “Learning Transferable Visual Models From Natural Language Supervision (CLIP),” ICML.
He et al. 2017, “Mask R-CNN,” ICCV.

Compendium

Explorer

Bin Picking — Dex-Net, FoundationPose, Grasp Planning — Robotics Reference