Computer Vision for Robotics

1. At a glance

Where [[Robotics/sensors-perception]] answers “what photons hit the sensor?”, this note answers “what does that pixel array mean?“. Computer vision is the algorithmic layer that converts images (and depth maps, point clouds, event streams) into the symbolic objects a planner can act on: an object class, a 6-DoF pose, a depth map, a motion vector, a free-space mask, an action label.

Modern robotic CV in 2026 divides into three coexisting paradigms:

Classical CV — geometry-first, hand-engineered features. Camera calibration, epipolar geometry, SIFT/ORB features, RANSAC, ICP, stereo block-matching, Lucas-Kanade flow. Still the backbone of every SLAM pipeline ([[Robotics/slam]]), every metric structure-from-motion solve, every hand-eye calibration, and every reprojection-error tracker. Deterministic, interpretable, no training data.
Deep learning CV — CNNs and Vision Transformers trained on labelled datasets. YOLO family for detection, U-Net / DeepLab / Mask2Former for segmentation, MiDaS / Depth Anything for monocular depth, FoundationPose / Megapose for 6-DoF pose, RAFT for optical flow. Dominant for any task with a perception-only output (detect, segment, classify, regress). Requires training data; opaque; statistically reliable on the training distribution.
VLM / foundation models — CLIP, SAM, GPT-4V, Gemini-Pro-Vision, LLaVA. Open-vocabulary: “find the red mug” / “describe what’s in front of the robot” / “is this object stackable on that one?” Zero or few-shot capable; high latency; expensive; the right answer when the object set is unbounded or the task is semantic.

Production robotic stacks blend all three. A bin-picker uses deep detection (YOLOv8 → “bottles here”) + classical 6-DoF refinement (ICP between segmented point cloud and CAD) + classical hand-eye calibration (Tsai-Lenz solve once at deploy). An AMR uses deep segmentation (free-space mask) + classical visual odometry (ORB + RANSAC) + VLM for occasional “what is this unknown object blocking the aisle?” queries.

First ask before architecting any robotic CV pipeline:

Closed-set vs open-set? If the object list is fixed (the 47 SKUs in this warehouse), train a closed-set detector — 100× cheaper at inference than a VLM. If anything could show up, accept VLM cost.
Edge or cloud compute? Edge constraints (Jetson Orin Nano, 8 W) bound model size to ~10-50 MB; cloud has no such limit but adds 100-500 ms round-trip and depends on connectivity.
How wrong is “wrong”? A misclassified object in a sorting line is a returned package; a misclassified pedestrian under an AV is a death. Drives confidence calibration, redundant sensors, fallback policies.
Do you have real-world training data? If not, plan sim-to-real (Isaac Replicator, BlenderProc) from day one — synthetic-only models without domain randomisation fail catastrophically on real robots.
What’s the latency budget? A cobot collision predictor at 30 Hz tolerates ~33 ms inference; an autonomous-racing perception system at 200 Hz tolerates 5 ms. Pick model size to match.

2. First principles

Image formation and the pinhole model

A photon emanating from a 3D point P = (X, Y, Z) in the camera frame passes through the lens and lands on the sensor at pixel p = (u, v). Under the pinhole model:

p̃ = K · [R | t] · P̃

with K the camera intrinsic matrix:

K =  [fx  0  cx]
     [ 0 fy  cy]
     [ 0  0   1]

f_x, f_y are focal lengths in pixels, (c_x, c_y) is the principal point, and [R | t] is the extrinsic SE(3) transform from world to camera. The full chain is lens → Bayer-filter CMOS → ISP → image: photons demosaiced into RGB, white-balanced, gamma-corrected, possibly YUV-encoded for transport. The ISP is photometrically destructive; CV pipelines that care about radiometry (HDR, photometric SLAM) capture RAW Bayer upstream.

Real lenses deviate from pinhole. Brown-Conrady radial-tangential distortion is the default model in OpenCV:

x' = x·(1 + k1·r² + k2·r⁴ + k3·r⁶) + 2·p1·x·y + p2·(r² + 2·x²)
y' = y·(1 + k1·r² + k2·r⁴ + k3·r⁶) + p1·(r² + 2·y²) + 2·p2·x·y

Fisheye and wide-angle (FoV > 100°) require Kannala-Brandt (4 coeffs), MEI omnidirectional, or the double-sphere model (Usenko 2018).

Epipolar geometry — two views of a scene

Given two pinhole cameras observing the same 3D point, corresponding image points x_1 and x_2 satisfy the epipolar constraint:

x_2^T · F · x_1 = 0

with F the 3×3 fundamental matrix (rank 2). If both intrinsics are known, normalise points and use the essential matrix:

E = K_2^T · F · K_1     and     E = [t]_× · R

decomposing E via SVD gives four (R, t) candidates — disambiguate by chirality (the reconstructed 3D points must be in front of both cameras). This is the foundation of stereo, structure-from-motion, and the front-end of visual SLAM. The 5-point relative-pose solver (Nistér 2004) is the fastest minimal solver; 7-point and 8-point algorithms exist for the uncalibrated case.

Classical feature detectors and descriptors

A feature is a local image patch distinctive enough to match across views.

Harris corners (1988) — detect points with high curvature in both image directions; the eigenvalue ratio of the local structure tensor.
SIFT (Lowe 2004) — DoG scale-space extrema + 128-D gradient-histogram descriptor. Scale, rotation, and partially illumination invariant. Patent expired 2020; now the geometric gold standard.
SURF (Bay 2006) — Hessian-based, faster SIFT. Patent still relevant in commercial deployments.
ORB (Rublee 2011, OpenCV’s default) — FAST keypoints + rotated BRIEF binary descriptor. ~10× faster than SIFT, fixed-length 256-bit binary descriptor matchable with Hamming distance. ORB-SLAM family is built on this.
AKAZE (Alcantarilla 2013) — nonlinear scale-space, M-LDB descriptor. Better invariance than ORB at modest extra cost.

Matching: nearest-neighbour in descriptor space, with Lowe’s ratio test (accept if d_1 / d_2 < 0.7-0.8) to reject ambiguous matches, then RANSAC to filter geometric outliers (typically fitting F, E, or a homography).

Modern (learned) features

A 2018-onward shift: replace hand-crafted SIFT/ORB with networks trained for repeatability and matchability.

SuperPoint (DeTone 2018) — self-supervised joint detector + descriptor, runs at 70 fps on a 2060.
R2D2 (Revaud 2019) — adds explicit reliability/repeatability scoring.
DISK (Tyszkiewicz 2020) — reinforcement-learning trained.
SuperGlue (Sarlin 2020) — graph-neural-network matcher over SuperPoint keypoints; massive accuracy gain over nearest-neighbour matching.
LightGlue (Lindenberger 2023) — same idea, ~5-10× faster; production-deployable on Jetson.
LoFTR (Sun 2021) — detector-free, dense transformer matching; great on textureless surfaces.

Classical features still win on raw speed and on radically out-of-domain imagery (medical, thermal, microscopy) where there’s no training data.

3. Practical math / worked examples

Worked example A — Hand-eye calibration for arm + wrist camera

The setup: a 6-DoF industrial arm with a camera bolted to the wrist. The camera sees a checkerboard fixed in the world. We need the constant SE(3) transform X = camera-to-gripper, so that knowing gripper pose lets us project camera observations into the world.

Procedure (Tsai-Lenz 1989):

Move the arm to 20 distinct poses, each with the checkerboard fully visible. Vary all 6 DoF; in particular include rotations about all three axes — pure translations leave X under-determined.
At each pose, capture an image and solve PnP (cv2.solvePnP with the IPPE_SQUARE or SQPNP solver) to get camera-to-checkerboard transform T_cb_i.
Record the robot’s reported gripper-to-base transform T_gb_i from the controller.

Form pairs of relative motions:

A_i = T_gb_{i+1}^{-1} · T_gb_i        (gripper motion between poses)
B_i = T_cb_{i+1}     · T_cb_i^{-1}     (camera motion between poses)

Solve A_i · X = X · B_i for the unknown X. Tsai-Lenz decouples rotation and translation: first solve R_X via Rodrigues vectors (linear system over all i), then t_X (linear).
Verify by computing residuals on held-out poses.

Typical residuals:

Robot class	Translational residual	Angular residual
Industrial arm (UR5e, KUKA KR6, Fanuc LR Mate)	0.5–2 mm	0.05–0.2°
Cobot (Franka Panda, UR10e)	1–3 mm	0.1–0.3°
Educational (UFactory xArm)	2–5 mm	0.2–0.5°
Mobile manipulator (Spot arm)	5–10 mm	0.3–0.8°

Alternative solvers: Park-Martin 1994 (Lie-algebra formulation), Daniilidis 1999 (dual quaternions, jointly solves R and t — best when rotation magnitudes are small). All three are in OpenCV’s cv2.calibrateHandEye (HANDEYE_TSAI, HANDEYE_PARK, HANDEYE_DANIILIDIS).

Failure modes: all 20 poses share a near-common axis of rotation (degenerate); checkerboard not rigidly fixed; gripper pose reports stale value (low-latency capture sync matters).

Worked example B — YOLOv8 inference on Jetson

YOLOv8 (Ultralytics, 2023) is the de-facto detector on edge robots in 2024-2026. Five model sizes — nano, small, medium, large, x-large — span from 3.2 M to 68 M parameters.

Workflow:

Train (or download) FP32 PyTorch weights (.pt).
Export to ONNX: yolo export model=yolov8m.pt format=onnx imgsz=640 simplify=True.
Build TensorRT engine: trtexec --onnx=yolov8m.onnx --saveEngine=yolov8m.engine --fp16 (or --int8 --calib=calib.txt for INT8).
Run inference via TensorRT Python API or Ultralytics model.predict() with the engine.

Measured throughput (640×640 input, 1 image batch, COCO 80 classes):

Platform	Precision	YOLOv8n	YOLOv8s	YOLOv8m	YOLOv8l
Jetson Orin Nano (40 TOPS INT8)	INT8	~130 fps	~70 fps	~30 fps	~15 fps
Jetson Orin NX 16 GB (100 TOPS)	INT8	~280 fps	~150 fps	~70 fps	~35 fps
Jetson AGX Orin 64 GB (275 TOPS)	INT8	~600 fps	~330 fps	~150 fps	~80 fps
Jetson AGX Orin 64 GB	FP16	~280 fps	~160 fps	~80 fps	~40 fps
RTX 4090	FP16	~2400 fps	~1500 fps	~800 fps	~400 fps

COCO accuracy (mAP@50-95): YOLOv8n 37.3, YOLOv8s 44.9, YOLOv8m 50.2, YOLOv8l 52.9, YOLOv8x 53.9. INT8 quantisation typically costs 0.5-2 mAP points relative to FP32 — usually acceptable.

Decision rule: start with YOLOv8s INT8 for any edge robot; move to YOLOv8m only if accuracy budget demands. Anything above YOLOv8l on an edge device is wasted unless you’re running it at 5-10 Hz for an analytics use case rather than reactive control.

A grasp planner needs the SE(3) pose of a bottle on the table to ±2 mm. A neural pose estimator (FoundationPose, Megapose) returns a coarse initial guess T_0 with ~1 cm / 5° error. Refine with ICP against the CAD model.

Algorithm (point-to-plane ICP, Chen-Medioni 1992 / Low 2004):

Segment the observed point cloud P_obs (cluster around the detector’s bounding box, remove plane, remove outliers via statistical filter).
Transform the CAD model’s sampled point cloud P_cad by T_0.
For each transformed CAD point p_i, find the nearest neighbour q_i in P_obs (KD-tree, ~O(log N)).
Solve the linearised least-squares problem:
```
argmin_{ΔT}  Σ_i  ((ΔT · p_i − q_i) · n_i)²
```
where n_i is the surface normal at q_i. Point-to-plane converges faster than point-to-point because it doesn’t penalise tangential sliding.
Compose T_0 ← ΔT · T_0; repeat until convergence (residual < threshold, e.g. 0.5 mm RMS).

Typical numbers: 30 iterations, ~0.5 mm RMS residual, converges in 20–80 ms on a CPU for ~5 000 points.

Robust variants:

Trimmed ICP (Chetverikov 2002) — discard the worst N% of correspondences per iteration. Robust to occlusion.
Generalised ICP (GICP) (Segal 2009) — plane-to-plane with covariance Mahalanobis distance. Excellent for noisy LiDAR.
Color ICP (Park 2017) — adds photometric cost; great for RGB-D registration of textured surfaces.
TEASER++ (Yang 2020) — certifiably-robust global registration before fine ICP; handles 99 % outlier rates.

When ICP fails: initial guess too far (> half the object’s characteristic length), symmetric objects (the rotation about the symmetry axis is unobservable — return a distribution over poses, not a single one), severe occlusion (< 30 % of object visible).

4. Design heuristics

Pick the model class by task

Task	Recommended primary	Why
Bin-picking known SKUs	FoundationPose / Megapose + ICP refine	Pose-to-grasp, robust to clutter
Bin-picking unknown objects	GraspNet baseline (Fang 2020), GIGA (Jiang 2021), Contact-GraspNet	Direct grasp-pose regression from point cloud
Pick-and-place on conveyor	YOLOv8 detection + tracker + classical 2D pose	Fast, predictable
Open-vocabulary “pick the X”	SAM (segment) + CLIP (classify) + grasp net	Zero-shot; latency 200–500 ms
Free-space / drivable mask	Mask2Former or BEVFormer	Semantic segmentation at scene scale
Drone obstacle avoidance	Monocular MiDaS / Depth Anything + classical VIO	60–120 Hz reactive
Object tracking across frames	BoT-SORT, ByteTrack, OC-SORT on top of YOLO	Identity persistence
Visual question answering	LLaVA-1.6, GPT-4V, Gemini-1.5 Pro Vision	”What’s broken about this part?”
Defect inspection	PaDiM, PatchCore, SimpleNet (anomaly detection)	One-class, only good samples needed
Surgical instrument tracking	Sage / DETR variants on stereo video	Sub-mm in-frame accuracy

Compute budget mapping

Budget	Sensible architecture
≤ 5 W (battery drone, prosthetic)	YOLOv8n / MobileNetV3, INT8 on Hailo-8 / Coral / Snapdragon NPU
5–25 W (Jetson Orin Nano/NX)	YOLOv8s/m INT8, Depth Anything Small FP16, ORB-SLAM3
25–60 W (Jetson AGX Orin)	YOLOv8l-x FP16, FoundationPose, full-res Mask2Former
100–300 W (workstation, RTX 4080/4090)	Full Detectron2 pipelines, NeRF / 3DGS training, ViT-Large pose
Cloud / data centre (A100, H100)	VLM (GPT-4V, Gemini), foundation models, training

Synthetic data and sim-to-real

Hand-labelling 50 000 robotic-grasp poses takes a year. Synthetic generation takes a weekend.

NVIDIA Isaac Sim + Replicator — physically-based path-traced rendering at scale, programmatic randomisation, USD asset library. Most production-grade.
BlenderProc (Denninger 2019) — Python-scripted Blender renders; ubiquitous in academia.
Unity Perception — game-engine renders, easy randomisation, weaker physics than Isaac.
NVIDIA Omniverse Replicator — distributed Isaac Replicator; large-scale dataset farms.

Sim-to-real techniques:

Domain randomisation (Tobin 2017) — randomise lighting, textures, camera intrinsics, distractor objects. Force the model to ignore everything except the task-relevant geometry.
Photometric realism — path tracing, true HDR environment maps, PBR materials, realistic ISP simulation.
Domain adaptation — CycleGAN-style sim→real translation; less common in 2024+ as randomisation has gotten cheap and effective.
Self-supervised real-world fine-tune — pseudo-labels from a sim-trained teacher; modest amount of real data closes the gap.

Active perception

Don’t ask “what’s in this image?” — ask “where should I look next?” Next-best-view planning maximises information gain per camera move; for occluded bin-picking it can halve cycle time. Classical formulations (Connolly 1985, Pito 1999) computed visibility frustums; modern (2023+) work uses NeRF/3DGS posterior-uncertainty for view selection.

Data hygiene

Label noise > 5 % quietly destroys supervised learning. Find bad labels by training a model and surfacing high-loss training examples (Northcutt 2021 “Confident Learning”).
Class imbalance > 100:1 needs focal loss (Lin 2017) or under/over-sampling.
Train/test distribution shift is the silent killer of robotic deployment — check that test set sensors, lighting, and operator behaviour match production.

Distillation and quantisation

Knowledge distillation (Hinton 2015) — train YOLOv8m as teacher, use soft logits to train YOLOv8n student. Closes ~50 % of the n-vs-m accuracy gap.
INT8 post-training quantisation — typically loses 0.5-2 mAP on COCO. Run the quantiser on a calibration set drawn from production (not COCO val).
QAT (quantisation-aware training) — fine-tune with fake-quant nodes; recovers most of the INT8 gap.

Camera placement beats model choice

Five extra centimetres of camera height eliminate self-occlusion. A 5° downward tilt collapses depth ambiguity. Glare on a windshield camera ruins every model. Fix the physics first.

5. Components & sourcing — frameworks, models, datasets

CV libraries (general)

Library	Strengths	Notes
OpenCV 4.10+	Classical CV (calib3d, features2d, imgproc), some DNN inference	C++ / Python / Java; ubiquitous
PyTorch 2.4+	Training, research, eager execution	Default in 2024+
TorchVision	Detection (Faster/Mask R-CNN, RetinaNet), classification backbones	Maintained by Meta
Ultralytics	YOLOv5/v8/v9/v10/v11 wrapper	Trivial CLI, ONNX/TensorRT export
Detectron2	Mask R-CNN, DensePose, PointRend, panoptic seg	Meta, research-grade
MMDetection / MMSegmentation	Modular OpenMMLab stack	200+ model configs; Chinese academic standard
HuggingFace Transformers	DETR, SAM, ViTPose, depth networks, VLMs	One-line model load
MediaPipe	On-device hand/face/body pose	Google, real-time on mobile
ONNX Runtime	Cross-platform inference	CPU, CUDA, TensorRT, CoreML, DirectML EPs
TensorRT	NVIDIA-optimised inference	Best Jetson / GPU throughput
OpenVINO	Intel-optimised inference	NUC, embedded x86

Robotics-specific frameworks

Framework	Role
ROS 2 + image_pipeline + vision_msgs	Camera transport, detection topic schemas (see `[[Languages/Tier3/robotics-control]]`)
NVIDIA Isaac SDK + Isaac Manipulator	Reference pipelines (FoundationPose, cuMotion)
NVIDIA DeepStream	Multi-camera production video analytics
Open3D	Point-cloud ops, ICP, RANSAC, normals, Poisson reconstruction
PCL (legacy)	Original point-cloud library; still used in older ROS stacks
GTSAM / g2o	Pose-graph and bundle adjustment back-ends
Ceres Solver	General nonlinear LS, used by COLMAP, ORB-SLAM3
COLMAP / OpenSfM	Offline structure-from-motion
Kornia	Differentiable CV ops in PyTorch; learnable homographies

6-DoF pose models and frameworks

Model	Year	Approach	Notes
PoseCNN (Xiang 2018)	2018	RGB → translation + Hough quaternion	YCB-V benchmark
DenseFusion (Wang 2019)	2019	Per-pixel fusion of RGB + depth features	Strong on cluttered scenes
PVN3D (He 2020)	2020	3D keypoints in point cloud	Cuts symmetric ambiguity
GDR-Net (Wang 2021)	2021	Geometry-guided regression on RGB	LineMod SOTA in 2021
Megapose (Labbé 2022)	2022	Render-and-compare for unseen objects	CAD at test time only
FoundationPose (Wen 2024, CVPR)	2024	One model, any CAD, RGB-D	Production-ready zero-shot
GraspNet baseline (Fang 2020)	2020	Direct grasp pose from point cloud	88 k grasps benchmark
GIGA (Jiang 2021)	2021	Implicit grasp affordance field	TSDF input
Contact-GraspNet (Sundermeyer 2021)	2021	Predicts contact grasp on scene cloud	Used in many cobot stacks

Foundation models (vision)

Model	Vendor	Use
SAM / SAM 2 (Kirillov 2023 / Ravi 2024)	Meta	Promptable segmentation; image + video
CLIP (Radford 2021)	OpenAI	Image ↔ text embedding; open-vocab classification
DINOv2 (Oquab 2023)	Meta	Self-supervised features; great for retrieval
Depth Anything V1/V2 (Yang 2024)	TikTok / HKU	Zero-shot monocular metric depth
GPT-4V / GPT-4o Vision	OpenAI	VLM; cloud API
Claude Vision (Opus 3.5+, Sonnet 4.x)	Anthropic	VLM; cloud API
Gemini 1.5/2.0 Pro Vision	Google	VLM; cloud API
LLaVA-1.5 / 1.6 / NeXT (Liu 2023+)	OSS	Local VLM; runs on RTX 4090
BLIP-2 (Li 2023)	Salesforce	Image captioning, VQA
Florence-2 (Microsoft 2024)	Microsoft	Unified detection/segmentation/captioning

Embedded inference accelerators

Part	TOPS	Power	Notes
NVIDIA Jetson Orin Nano 8 GB	40 (INT8)	7-15 W	Hobby + prosumer
NVIDIA Jetson Orin NX 16 GB	100 (INT8)	10-25 W	Drone / AMR sweet-spot
NVIDIA Jetson AGX Orin 64 GB	275 (INT8)	15-60 W	Heavy perception
NVIDIA Thor (Drive)	~1000 (INT8)	~50-130 W	Production AV, 2025+
Hailo-8	26 (INT8)	2.5 W	M.2 / PCIe addon
Hailo-10H	40 (INT8)	~3.5 W	2024 release
Google Coral Edge TPU	4 (INT8)	2 W	Quantised int8 only, model size capped
AMD/Xilinx Kria K26 / K24	~1-3 (DPU)	5-15 W	FPGA-based, deterministic latency
Intel Movidius Myriad X / OpenVINO	~1 (INT8)	2 W	NUC + USB stick
Qualcomm RB5 / RB6	15 / 35+ (INT8)	5-15 W	Snapdragon Flight drone reference
Apple M2/M4 Neural Engine	16-38 (INT8)	shared	iPad / Mac robots, ARKit pipeline
TI TDA4VM / TDA4AEN	8 (INT8) + DSP	5-20 W	Automotive ECU

Datasets

Dataset	Domain	Notes
COCO (Lin 2014)	80-class detection, segmentation, keypoints	330 k images; the universal benchmark
LVIS (Gupta 2019)	1203-class long-tail detection	COCO superset for rare-object eval
Open Images V7 (Kuznetsova 2018+)	600 classes, 9 M images	Largest free-use detection dataset
ImageNet-21k	21 k-class classification	Pretraining backbone
YCB-Video (Xiang 2017)	21 household objects, 6-DoF pose	Bin-picking standard
T-LESS (Hodaň 2017)	30 texture-less industrial objects	Hard pose benchmark
LineMod / LineMod-Occluded (Hinterstoisser 2012)	15 textured objects + occlusion	Original 6-DoF benchmark
BOP Challenge	11 datasets unified	Annual 6-DoF pose competition
KITTI (Geiger 2012)	Driving stereo + Velodyne + GPS	First AV benchmark
nuScenes (Caesar 2020)	6-cam + 1-LiDAR + 5-radar	Modern AV dataset
Waymo Open Dataset (Sun 2020)	5-LiDAR + 5-cam	Highest-fidelity AV
DeepDrive / BDD100K	100 k driving videos	Diverse weather
Habitat / iGibson / Replica	Photoreal indoor sim	Embodied AI benchmarks
RoboNet (Dasari 2019)	15 M robot manipulation frames	Cross-robot learning
Berkeley Open Manipulation	Manipulation video	Skill learning
Cityscapes (Cordts 2016)	Urban street segmentation	Driving semantic seg
ADE20K (Zhou 2017)	150-class scene segmentation	General segmentation

6. Reference data

Classical CV vs deep CV — when to pick which

Aspect	Classical	Deep learning
Geometric ground truth (calibration, pose, SLAM front-end)	✓	rarely
Semantic understanding (object class, free space)	✗	✓
Out-of-domain (novel sensor, no training data)	✓	✗
Determinism / certification	✓	hard
Sub-pixel accuracy	✓ (with refinement)	typically not
Open-vocabulary	✗	✓ (VLM)
Compute cost	low	medium-high
Sensitivity to lighting	high	trained-distribution-dependent
Symmetry / scale ambiguity handling	by design	learned, sometimes wrong

Detector benchmark (COCO 2017 val, single-scale)

Detector	Backbone	mAP@50-95	FPS (V100/A100)	Year
Faster R-CNN	ResNet-50 FPN	40.2	26	2015
RetinaNet	ResNet-50 FPN	36.5	30	2017
Mask R-CNN	ResNet-50 FPN	37.9 box / 34.8 mask	23	2017
YOLOv4	CSPDarknet53	43.5	65	2020
YOLOv5l	CSPDarknet	49.0	99	2020
YOLOX-L	Modified CSPDarknet	50.0	69	2021
DETR	ResNet-50	42.0	28	2020
Deformable DETR	ResNet-50	46.2	19	2021
DINO-DETR	Swin-L	63.3	12	2022
RT-DETR-L	HGNetv2	53.0	114	2024
YOLOv8n / s / m / l / x	—	37.3 / 44.9 / 50.2 / 52.9 / 53.9	very fast (see worked ex B)	2023
YOLOv9-C / E	—	53.0 / 55.6	similar	2024
YOLOv10-N / S / M / L / X	—	38.5 / 46.3 / 51.1 / 53.2 / 54.4	+ 15-30% over v8	2024
YOLOv11-N / S / M / L / X	—	39.5 / 47.0 / 51.5 / 53.4 / 54.7	similar to v10	2024

Segmentation benchmark (COCO, Cityscapes)

Model	Task	Score	Notes
U-Net (Ronneberger 2015)	Binary biomedical seg	task-dep	small datasets
DeepLabV3+ (Chen 2018)	Cityscapes mIoU	82.1	strong outdoor
Mask2Former Swin-L (Cheng 2022)	COCO panoptic PQ	57.8	unified det + seg
SAM ViT-H (Kirillov 2023)	promptable, no class	n/a	1.1 B masks pretrain
SAM 2 (Ravi 2024)	promptable image + video	n/a	video segmentation SOTA
OneFormer (Jain 2023)	universal	top-tier	single model 3 seg tasks

Depth estimation benchmark

Model	Type	KITTI / NYUv2	Notes
MiDaS v3.1 (Ranftl 2020+)	Mono relative	strong	DPT backbone, robust
ZoeDepth (Bhat 2023)	Mono metric	0.075 RMSE NYU	combines MiDaS + metric heads
Metric3D V2 (Hu 2024)	Mono metric	top-tier	unified intrinsics handling
Depth Anything V2 (Yang 2024)	Mono relative + metric	SOTA zero-shot	trained 62 M images
RAFT-Stereo (Lipson 2021)	Stereo	KITTI top	iterative GRU
PSMNet (Chang 2018)	Stereo	KITTI ~2 % bad-3	pyramid cost volume
FoundationStereo (2024)	Stereo zero-shot	new SOTA	foundation-model approach
NeRF (Mildenhall 2020)	Multi-view implicit	scene-specific	training per scene
3D Gaussian Splatting (Kerbl 2023)	Multi-view explicit	real-time render	takes minutes to train, ms to render

6-DoF pose benchmark (BOP Challenge 2024 — average recall)

Method	LineMod-O	YCB-V	T-LESS	Notes
PVN3D	0.72	0.85	0.67	depth-keypoints
GDR-Net	0.71	0.80	0.68	RGB only
Megapose	0.79	0.85	0.71	unseen objects
FoundationPose	0.91	0.93	0.89	2024 SOTA, RGB-D
FoundationPose RGB-only	0.83	0.86	0.78	no depth required

Accelerator inference throughput (additional, vs Sec 5)

Model	Jetson Orin Nano INT8	Orin NX INT8	AGX Orin INT8	RTX 4090 FP16
YOLOv8m	30 fps	70 fps	150 fps	800 fps
Mask2Former Swin-S	n/a (too big)	8 fps	25 fps	90 fps
Depth Anything V2 Small	15 fps	35 fps	80 fps	400 fps
FoundationPose	4 fps	12 fps	35 fps	120 fps
SAM ViT-B	n/a	3 fps	8 fps	35 fps
LightGlue (1024 kpts)	30 fps	80 fps	180 fps	700 fps
RAFT optical flow (1080p)	4 fps	12 fps	35 fps	120 fps

Coordinate-frame conventions

System	+X	+Y	+Z
OpenCV camera	right	down	forward
ROS / REP-103	forward	left	up
OpenGL / glTF	right	up	toward viewer
Unreal	forward	right	up
Blender	right	forward	up
Unity	right	up	forward (left-handed)
KITTI camera	right	down	forward (matches OpenCV)
KITTI velodyne	forward	left	up (matches ROS)

Confusing these has wasted more engineer-hours than any other CV bug.

7. Failure modes & debugging

Domain gap (sim → real) — model trained in Isaac fails on real RGB. Diagnose: side-by-side renders vs photographs; check sensor noise model, ISP gamma, lighting. Fix: domain randomisation, photometric tuning, modest fine-tune on real.
Catastrophic forgetting during fine-tune — model loses pre-training capabilities after a few epochs on a small target dataset. Fix: lower LR, freeze backbone, mix in original-domain data (replay), or use LoRA / adapter layers.
Model confidence ≠ accuracy — a softmax score of 0.95 means “well-trained on data like this”, not “95 % probability correct”. Calibrate with temperature scaling (Guo 2017) or, better, conformal prediction sets.
Spurious correlation / shortcut learning — model classifies tanks by photo time of day. Diagnose with Grad-CAM, integrated gradients, or counterfactual augmentation. Fix: targeted augmentation, mask out background, balance dataset.
Adversarial / patch attacks — a printed sticker on a stop sign defeats a detector. Robotics-safety mitigation: ensemble across modalities (LiDAR + radar + camera), out-of-distribution rejection, latency-bounded sanity checks.
Slow per-pixel ops in Python — Looping over pixels in a NumPy array runs at 1 fps. Vectorise via cv2.* calls, or move to GPU (CuPy, Torch tensors, CUDA streams).
Lens dirt — a fingerprint, dust, or water drop on the lens silently corrupts every downstream prediction. Sanity-check: histogram of central image patch vs corners; a per-camera “dirty-lens classifier” running at 1 Hz catches most cases.
HDR / saturation in extreme dynamic range — sunlight through a tunnel exit saturates the sensor for 100+ ms. Mitigation: HDR sensors (Sony IMX490, 120+ dB), bracketed exposures, sensor-level tone mapping.
Latency spikes from CPU contention — perception jitter caused by other processes. Pin CPU affinities; use isolcpus kernel parameter; route inference exclusively to GPU/NPU.
GPU / TensorRT memory leak — long-running pipeline OOMs after hours. Pre-allocate all tensors; reuse contexts; monitor with nvidia-smi --query-gpu=memory.used --loop=10.
Network bandwidth saturation when streaming 4K** — see [[Robotics/sensors-perception]] worked example D. Compress on-camera (H.265, JPEG-XS), stream ROIs only, or do CV at the edge.
Coordinate-frame mistakes — see Sec 6 conventions table. Verify by projecting a known marker (AprilTag, ChArUco) and confirming the projected pixels match observation.
Pose ambiguity for symmetric objects — bottle SO(2)-symmetric about its axis; the rotation about that axis is unobservable. Don’t return a single pose — return a distribution (one rotation per symmetry class) and let the grasp planner pick.
Label bias / class skew — COCO has 100× more “person” annotations than “hair drier”; a model trained on COCO is biased toward common classes. For robotic deployment, train on dataset matched to deployment distribution.
Dataset leakage — train images contain the same scene under slightly different lighting as the test set. Measured accuracy is wildly optimistic. Strict scene-level splits, not image-level.
Photometric drift — sun angle changes; descriptor matching degrades over the day. Robust features (SuperPoint, ORB) > photometric VO (DSO) in outdoor settings.
Face / licence-plate visibility — privacy regulation (GDPR, CCPA) forbids storing identifiable images. Blur faces and plates on-device (MediaPipe, YOLO-face) before any persistence.
Stereo disparity collapse on textureless surfaces — blank walls, white floors, glass. Fix: active stereo (RealSense projector), LiDAR fusion, or learned matchers (RAFT-Stereo, CREStereo) that hallucinate plausible disparity.
Pose flip on 180° symmetric objects — a cylinder with a tiny asymmetry near one end may be matched flipped 50 % of the time. Augment training with the symmetric pose and use the symmetry-aware loss (Wang 2019).
AprilTag / ArUco false positives — fast-moving cameras motion-blur tags into false detections. Set marker decision-margin minimums; reject if reprojection error > 1 px.

8. Case studies

Boston Dynamics Stretch — warehouse box manipulation

The Stretch robot (commercial release 2022) is purpose-built for unloading 23 kg boxes from trailers at ~800 boxes/hr. Its perception stack is a Detectron2-derived custom CNN trained on millions of synthetic + real labelled box images, running on an onboard x86 compute node (no Jetson — the platform has ample power budget). The detector segments box top-faces and front-faces from a wrist-mounted RGB-D camera (Intel RealSense-class); a downstream grasp planner picks the closest visible box, then the 7-DoF “smart gripper” arm executes the grasp.

Sim-to-real comes from Boston Dynamics’ internal simulator (an Isaac Sim derivative) plus a few hundred thousand real-trailer images collected from customer deployments. The system tolerates 30+% box overlap, missing labels, banged-up cardboard, and “weird stacks” (Brazilian-pallet diagonal patterns). The vision stack runs at ~10 Hz; the bottleneck is mechanical motion, not perception.

Skydio X10 — vision-only autonomous drone

The Skydio X10 (2023) deploys six Sony IMX-class CMOS sensors in a hemispherical hex configuration, fused into a 360° depth + tracking system on an onboard NVIDIA Jetson Orin NX. The perception software is proprietary (Skydio Autonomy Engine 3+) but reportedly combines stereo from each adjacent camera pair, a learned monocular depth network for far-field, and a fused 3D occupancy grid at 30 Hz.

The signature behaviour — follow-me through dense forest at 10 m/s — relies on a person-tracker (multi-camera person re-ID + Kalman filter) plus a real-time path planner that operates on the occupancy grid. Skydio claims sub-second-latency obstacle avoidance, no LiDAR, no radar. The architectural bet — that vision plus enough cameras can replace LiDAR — has proven out for small drones (different conclusion from Boston Dynamics Spot or Waymo, where mass/power budgets favour LiDAR). Production deployments include US DoT bridge inspection, military reconnaissance (Skydio X2D), and forest firefighting.

Covariant Brain — manipulation foundation model

Covariant (acquired by Amazon, 2024) operates one of the largest robotic-manipulation foundation models in the field. Their Covariant Brain (published RSS 2022, updated 2023-2024) is a transformer-based vision-action model trained on millions of real bin-picks captured across customer warehouses (Knapp, GAP, USPS, RadialSpring). Input: RGB-D from a fixed overhead camera plus a wrist camera. Output: a grasp pose, an approach trajectory, and a confidence score.

The system claims human-level reliability on 10 000+ SKU bins with no per-SKU configuration — a sharp contrast with traditional industrial vision where every part requires CAD + manual setup. Covariant attributes this to scale: a single model trained on the union of all customer data generalises across customers, while a per-customer model trained on a 10 k-pick dataset would not. The platform shipped on RB-style cobots (Yaskawa, ABB, Universal Robots) before the Amazon acquisition broadened it onto Amazon’s internal fleet.

9. Cross-references

[[Robotics/sensors-perception]] — camera, LiDAR, depth-sensor hardware that feeds this layer; the bandwidth, sync, and calibration concerns are upstream of every model in this note.
[[Robotics/slam]] — companion note (same batch); SLAM is where this note’s feature-matching, depth estimation, and pose estimation get fused into a coherent map + trajectory.
[[Robotics/bayesian-estimation]] — planned; the Kalman / EKF / particle-filter mathematics for fusing detection observations over time into trackers (BoT-SORT, ByteTrack are non-Bayesian; production safety systems usually want explicit EKF).
[[Robotics/end-effectors]] — planned; grasp planners consume 6-DoF poses from the FoundationPose / GraspNet outputs covered here.
[[Robotics/manipulator-design]] — planned; payload, reach, and wrist DoF constrain how a vision-derived grasp can be executed.
[[Robotics/path-planning]] — planned; the free-space mask and occupancy grid from segmentation feed the planner.
[[Robotics/kinematics-dh]] — extrinsic camera-to-base calibration uses the same SE(3) chain math.
[[Robotics/impedance-control]] — visually-servoed contact tasks blend the pose estimates here with force feedback there.
[[Engineering/microcontrollers]] — embedded inference targets (NXP S32, STM32MP, Ambarella).
[[Engineering/fpga-design]] — Xilinx Kria / Lattice CrossLink fixed-latency vision pipelines.
[[Languages/Tier3/3d-scene]] — point-cloud / mesh formats (PLY, USD, glTF, E57) used to store CAD references for ICP and renders for synthetic data.
[[Languages/Tier3/robotics-control]] — vision_msgs, sensor_msgs/Image, sensor_msgs/PointCloud2 schemas; ROS 2 conventions for perception topics.
[[Languages/Tier3/genai-llm-runtime]] — planned; VLM serving stacks (vLLM, TensorRT-LLM, llama.cpp) for local LLaVA / BLIP-2 inference on robots.

10. Citations

Hartley, R. & Zisserman, A. (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. The canonical reference for projective geometry, F/E matrices, calibration.
Szeliski, R. (2022). Computer Vision: Algorithms and Applications (2nd ed.). Springer. Free PDF at szeliski.org/Book. Definitive practitioner reference.
Forsyth, D. A. & Ponce, J. (2011). Computer Vision: A Modern Approach (2nd ed.). Pearson.
Lowe, D. (2004). “Distinctive Image Features from Scale-Invariant Keypoints.” IJCV 60(2), 91–110. DOI 10.1023/B:VISI.0000029664.99615.94 (SIFT).
Bay, H., Tuytelaars, T. & Van Gool, L. (2006). “SURF: Speeded Up Robust Features.” ECCV.
Rublee, E., Rabaud, V., Konolige, K. & Bradski, G. (2011). “ORB: An efficient alternative to SIFT or SURF.” ICCV.
Hirschmüller, H. (2008). “Stereo Processing by Semiglobal Matching and Mutual Information.” IEEE TPAMI 30(2), 328–341.
Nistér, D. (2004). “An Efficient Solution to the Five-Point Relative Pose Problem.” IEEE TPAMI 26(6), 756–770.
Gao, X. S., Hou, X. R., Tang, J. & Cheng, H. F. (2003). “Complete solution classification for the perspective-three-point problem.” IEEE TPAMI 25(8), 930–943 (P3P).
Tsai, R. & Lenz, R. (1989). “A New Technique for Fully Autonomous and Efficient 3D Robotics Hand/Eye Calibration.” IEEE Trans. Robotics & Automation 5(3), 345–358.
Park, F. C. & Martin, B. J. (1994). “Robot Sensor Calibration: Solving AX = XB on the Euclidean Group.” IEEE Trans. RA 10(5), 717–721.
Daniilidis, K. (1999). “Hand-Eye Calibration Using Dual Quaternions.” IJRR 18(3), 286–298.
Besl, P. J. & McKay, N. D. (1992). “A Method for Registration of 3-D Shapes.” IEEE TPAMI 14(2), 239–256 (ICP).
Chen, Y. & Medioni, G. (1992). “Object modelling by registration of multiple range images.” Image and Vision Computing 10(3), 145–155 (point-to-plane ICP).
Segal, A., Haehnel, D. & Thrun, S. (2009). “Generalized-ICP.” RSS (GICP).
Yang, H., Shi, J. & Carlone, L. (2020). “TEASER: Fast and Certifiable Point Cloud Registration.” IEEE T-RO 37(2), 314–333.
DeTone, D., Malisiewicz, T. & Rabinovich, A. (2018). “SuperPoint: Self-Supervised Interest Point Detection and Description.” CVPRW.
Sarlin, P.-E., DeTone, D., Malisiewicz, T. & Rabinovich, A. (2020). “SuperGlue: Learning Feature Matching with Graph Neural Networks.” CVPR.
Lindenberger, P., Sarlin, P.-E. & Pollefeys, M. (2023). “LightGlue: Local Feature Matching at Light Speed.” ICCV.
Sun, J. et al. (2021). “LoFTR: Detector-Free Local Feature Matching with Transformers.” CVPR.
Ronneberger, O., Fischer, P. & Brox, T. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation.” MICCAI.
He, K., Gkioxari, G., Dollár, P. & Girshick, R. (2017). “Mask R-CNN.” ICCV.
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). “Focal Loss for Dense Object Detection.” ICCV (RetinaNet).
Carion, N. et al. (2020). “End-to-End Object Detection with Transformers.” ECCV (DETR).
Zhao, Y. et al. (2024). “DETRs Beat YOLOs on Real-time Object Detection.” CVPR (RT-DETR).
Cheng, B. et al. (2022). “Masked-attention Mask Transformer for Universal Image Segmentation.” CVPR (Mask2Former).
Kirillov, A. et al. (2023). “Segment Anything.” ICCV (SAM).
Ravi, N. et al. (2024). “SAM 2: Segment Anything in Images and Videos.” arXiv:2408.00714.
Radford, A. et al. (2021). “Learning Transferable Visual Models from Natural Language Supervision.” ICML (CLIP).
Oquab, M. et al. (2023). “DINOv2: Learning Robust Visual Features without Supervision.” arXiv:2304.07193.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. & Koltun, V. (2020). “Towards Robust Monocular Depth Estimation.” IEEE TPAMI (MiDaS).
Yang, L., Kang, B., Huang, Z., Zhao, Z. & Feng, J. (2024). “Depth Anything V2.” NeurIPS.
Mildenhall, B. et al. (2020). “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” ECCV.
Kerbl, B., Kopanas, G., Leimkühler, T. & Drettakis, G. (2023). “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” SIGGRAPH.
Teed, Z. & Deng, J. (2020). “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow.” ECCV.
Lipson, L., Teed, Z. & Deng, J. (2021). “RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching.” 3DV.
Chang, J.-R. & Chen, Y.-S. (2018). “Pyramid Stereo Matching Network.” CVPR (PSMNet).
Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. (2018). “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes.” RSS.
Wang, C. et al. (2019). “DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion.” CVPR.
Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M. & Fox, D. (2022). “Megapose: 6D Pose Estimation of Novel Objects via Render & Compare.” CoRL.
Wen, B., Yang, W., Kautz, J. & Birchfield, S. (2024). “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects.” CVPR.
Fang, H.-S., Wang, C., Gou, M. & Lu, C. (2020). “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping.” CVPR.
Jiang, Z., Zhu, Y., Svetlik, M., Fang, K. & Zhu, Y. (2021). “Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations.” RSS (GIGA).
Sundermeyer, M., Mousavian, A., Triebel, R. & Fox, D. (2021). “Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes.” ICRA.
Tobin, J. et al. (2017). “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.” IROS.
Denninger, M. et al. (2019). “BlenderProc.” arXiv:1911.01911.
Hinton, G., Vinyals, O. & Dean, J. (2015). “Distilling the Knowledge in a Neural Network.” arXiv:1503.02531.
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). “On Calibration of Modern Neural Networks.” ICML.
Northcutt, C., Jiang, L. & Chuang, I. (2021). “Confident Learning: Estimating Uncertainty in Dataset Labels.” JAIR.
Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). “Visual Instruction Tuning.” NeurIPS (LLaVA).
Li, J., Li, D., Savarese, S. & Hoi, S. (2023). “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs.” ICML.
Lin, T.-Y. et al. (2014). “Microsoft COCO: Common Objects in Context.” ECCV.
Geiger, A., Lenz, P. & Urtasun, R. (2012). “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite.” CVPR.
Caesar, H. et al. (2020). “nuScenes: A Multimodal Dataset for Autonomous Driving.” CVPR.
Sun, P. et al. (2020). “Scalability in Perception for Autonomous Driving: Waymo Open Dataset.” CVPR.
Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. (2017). “YCB-Video Dataset.” (used in PoseCNN paper).
Hodaň, T. et al. (2017). “T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects.” WACV.
Hinterstoisser, S. et al. (2012). “Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes.” ACCV (LineMod).
Hodaň, T. et al. (2024). “BOP Challenge 2023 / 2024 Reports.” (annual).
OpenCV documentation, opencv.org (4.10+).
PyTorch / TorchVision documentation, pytorch.org.
Ultralytics YOLOv8 / v9 / v10 / v11 documentation, docs.ultralytics.com.
Detectron2 documentation, github.com/facebookresearch/detectron2.
NVIDIA Isaac Sim / Replicator documentation (2024).
NVIDIA TensorRT 10.x Developer Guide.
Open3D documentation, www.open3d.org.
Boston Dynamics (2024). Stretch Technical Brief.
Skydio (2023). Skydio X10 Product Brief.
Covariant (2022). “RFM-1: A Foundation Model for Robotics.” Robotics Science and Systems Workshop on Robot Learning.

Compendium

Explorer

Computer Vision for Robotics — Robotics Reference