Computer Vision for Robotics
1. At a glance
Where [[Robotics/sensors-perception]] answers “what photons hit the sensor?”, this note answers “what does that pixel array mean?“. Computer vision is the algorithmic layer that converts images (and depth maps, point clouds, event streams) into the symbolic objects a planner can act on: an object class, a 6-DoF pose, a depth map, a motion vector, a free-space mask, an action label.
Modern robotic CV in 2026 divides into three coexisting paradigms:
- Classical CV — geometry-first, hand-engineered features. Camera calibration, epipolar geometry, SIFT/ORB features, RANSAC, ICP, stereo block-matching, Lucas-Kanade flow. Still the backbone of every SLAM pipeline (
[[Robotics/slam]]), every metric structure-from-motion solve, every hand-eye calibration, and every reprojection-error tracker. Deterministic, interpretable, no training data. - Deep learning CV — CNNs and Vision Transformers trained on labelled datasets. YOLO family for detection, U-Net / DeepLab / Mask2Former for segmentation, MiDaS / Depth Anything for monocular depth, FoundationPose / Megapose for 6-DoF pose, RAFT for optical flow. Dominant for any task with a perception-only output (detect, segment, classify, regress). Requires training data; opaque; statistically reliable on the training distribution.
- VLM / foundation models — CLIP, SAM, GPT-4V, Gemini-Pro-Vision, LLaVA. Open-vocabulary: “find the red mug” / “describe what’s in front of the robot” / “is this object stackable on that one?” Zero or few-shot capable; high latency; expensive; the right answer when the object set is unbounded or the task is semantic.
Production robotic stacks blend all three. A bin-picker uses deep detection (YOLOv8 → “bottles here”) + classical 6-DoF refinement (ICP between segmented point cloud and CAD) + classical hand-eye calibration (Tsai-Lenz solve once at deploy). An AMR uses deep segmentation (free-space mask) + classical visual odometry (ORB + RANSAC) + VLM for occasional “what is this unknown object blocking the aisle?” queries.
First ask before architecting any robotic CV pipeline:
- Closed-set vs open-set? If the object list is fixed (the 47 SKUs in this warehouse), train a closed-set detector — 100× cheaper at inference than a VLM. If anything could show up, accept VLM cost.
- Edge or cloud compute? Edge constraints (Jetson Orin Nano, 8 W) bound model size to ~10-50 MB; cloud has no such limit but adds 100-500 ms round-trip and depends on connectivity.
- How wrong is “wrong”? A misclassified object in a sorting line is a returned package; a misclassified pedestrian under an AV is a death. Drives confidence calibration, redundant sensors, fallback policies.
- Do you have real-world training data? If not, plan sim-to-real (Isaac Replicator, BlenderProc) from day one — synthetic-only models without domain randomisation fail catastrophically on real robots.
- What’s the latency budget? A cobot collision predictor at 30 Hz tolerates ~33 ms inference; an autonomous-racing perception system at 200 Hz tolerates 5 ms. Pick model size to match.
2. First principles
Image formation and the pinhole model
A photon emanating from a 3D point P = (X, Y, Z) in the camera frame passes through the lens and lands on the sensor at pixel p = (u, v). Under the pinhole model:
p̃ = K · [R | t] · P̃
with K the camera intrinsic matrix:
K = [fx 0 cx]
[ 0 fy cy]
[ 0 0 1]
f_x, f_y are focal lengths in pixels, (c_x, c_y) is the principal point, and [R | t] is the extrinsic SE(3) transform from world to camera. The full chain is lens → Bayer-filter CMOS → ISP → image: photons demosaiced into RGB, white-balanced, gamma-corrected, possibly YUV-encoded for transport. The ISP is photometrically destructive; CV pipelines that care about radiometry (HDR, photometric SLAM) capture RAW Bayer upstream.
Real lenses deviate from pinhole. Brown-Conrady radial-tangential distortion is the default model in OpenCV:
x' = x·(1 + k1·r² + k2·r⁴ + k3·r⁶) + 2·p1·x·y + p2·(r² + 2·x²)
y' = y·(1 + k1·r² + k2·r⁴ + k3·r⁶) + p1·(r² + 2·y²) + 2·p2·x·y
Fisheye and wide-angle (FoV > 100°) require Kannala-Brandt (4 coeffs), MEI omnidirectional, or the double-sphere model (Usenko 2018).
Epipolar geometry — two views of a scene
Given two pinhole cameras observing the same 3D point, corresponding image points x_1 and x_2 satisfy the epipolar constraint:
x_2^T · F · x_1 = 0
with F the 3×3 fundamental matrix (rank 2). If both intrinsics are known, normalise points and use the essential matrix:
E = K_2^T · F · K_1 and E = [t]_× · R
decomposing E via SVD gives four (R, t) candidates — disambiguate by chirality (the reconstructed 3D points must be in front of both cameras). This is the foundation of stereo, structure-from-motion, and the front-end of visual SLAM. The 5-point relative-pose solver (Nistér 2004) is the fastest minimal solver; 7-point and 8-point algorithms exist for the uncalibrated case.
Classical feature detectors and descriptors
A feature is a local image patch distinctive enough to match across views.
- Harris corners (1988) — detect points with high curvature in both image directions; the eigenvalue ratio of the local structure tensor.
- SIFT (Lowe 2004) — DoG scale-space extrema + 128-D gradient-histogram descriptor. Scale, rotation, and partially illumination invariant. Patent expired 2020; now the geometric gold standard.
- SURF (Bay 2006) — Hessian-based, faster SIFT. Patent still relevant in commercial deployments.
- ORB (Rublee 2011, OpenCV’s default) — FAST keypoints + rotated BRIEF binary descriptor. ~10× faster than SIFT, fixed-length 256-bit binary descriptor matchable with Hamming distance. ORB-SLAM family is built on this.
- AKAZE (Alcantarilla 2013) — nonlinear scale-space, M-LDB descriptor. Better invariance than ORB at modest extra cost.
Matching: nearest-neighbour in descriptor space, with Lowe’s ratio test (accept if d_1 / d_2 < 0.7-0.8) to reject ambiguous matches, then RANSAC to filter geometric outliers (typically fitting F, E, or a homography).
Modern (learned) features
A 2018-onward shift: replace hand-crafted SIFT/ORB with networks trained for repeatability and matchability.
- SuperPoint (DeTone 2018) — self-supervised joint detector + descriptor, runs at 70 fps on a 2060.
- R2D2 (Revaud 2019) — adds explicit reliability/repeatability scoring.
- DISK (Tyszkiewicz 2020) — reinforcement-learning trained.
- SuperGlue (Sarlin 2020) — graph-neural-network matcher over SuperPoint keypoints; massive accuracy gain over nearest-neighbour matching.
- LightGlue (Lindenberger 2023) — same idea, ~5-10× faster; production-deployable on Jetson.
- LoFTR (Sun 2021) — detector-free, dense transformer matching; great on textureless surfaces.
Classical features still win on raw speed and on radically out-of-domain imagery (medical, thermal, microscopy) where there’s no training data.
3. Practical math / worked examples
Worked example A — Hand-eye calibration for arm + wrist camera
The setup: a 6-DoF industrial arm with a camera bolted to the wrist. The camera sees a checkerboard fixed in the world. We need the constant SE(3) transform X = camera-to-gripper, so that knowing gripper pose lets us project camera observations into the world.
Procedure (Tsai-Lenz 1989):
-
Move the arm to 20 distinct poses, each with the checkerboard fully visible. Vary all 6 DoF; in particular include rotations about all three axes — pure translations leave X under-determined.
-
At each pose, capture an image and solve PnP (
cv2.solvePnPwith the IPPE_SQUARE or SQPNP solver) to get camera-to-checkerboard transform T_cb_i. -
Record the robot’s reported gripper-to-base transform T_gb_i from the controller.
-
Form pairs of relative motions:
A_i = T_gb_{i+1}^{-1} · T_gb_i (gripper motion between poses) B_i = T_cb_{i+1} · T_cb_i^{-1} (camera motion between poses) -
Solve A_i · X = X · B_i for the unknown X. Tsai-Lenz decouples rotation and translation: first solve R_X via Rodrigues vectors (linear system over all i), then t_X (linear).
-
Verify by computing residuals on held-out poses.
Typical residuals:
| Robot class | Translational residual | Angular residual |
|---|---|---|
| Industrial arm (UR5e, KUKA KR6, Fanuc LR Mate) | 0.5–2 mm | 0.05–0.2° |
| Cobot (Franka Panda, UR10e) | 1–3 mm | 0.1–0.3° |
| Educational (UFactory xArm) | 2–5 mm | 0.2–0.5° |
| Mobile manipulator (Spot arm) | 5–10 mm | 0.3–0.8° |
Alternative solvers: Park-Martin 1994 (Lie-algebra formulation), Daniilidis 1999 (dual quaternions, jointly solves R and t — best when rotation magnitudes are small). All three are in OpenCV’s cv2.calibrateHandEye (HANDEYE_TSAI, HANDEYE_PARK, HANDEYE_DANIILIDIS).
Failure modes: all 20 poses share a near-common axis of rotation (degenerate); checkerboard not rigidly fixed; gripper pose reports stale value (low-latency capture sync matters).
Worked example B — YOLOv8 inference on Jetson
YOLOv8 (Ultralytics, 2023) is the de-facto detector on edge robots in 2024-2026. Five model sizes — nano, small, medium, large, x-large — span from 3.2 M to 68 M parameters.
Workflow:
- Train (or download) FP32 PyTorch weights (
.pt). - Export to ONNX:
yolo export model=yolov8m.pt format=onnx imgsz=640 simplify=True. - Build TensorRT engine:
trtexec --onnx=yolov8m.onnx --saveEngine=yolov8m.engine --fp16(or--int8 --calib=calib.txtfor INT8). - Run inference via TensorRT Python API or Ultralytics
model.predict()with the engine.
Measured throughput (640×640 input, 1 image batch, COCO 80 classes):
| Platform | Precision | YOLOv8n | YOLOv8s | YOLOv8m | YOLOv8l |
|---|---|---|---|---|---|
| Jetson Orin Nano (40 TOPS INT8) | INT8 | ~130 fps | ~70 fps | ~30 fps | ~15 fps |
| Jetson Orin NX 16 GB (100 TOPS) | INT8 | ~280 fps | ~150 fps | ~70 fps | ~35 fps |
| Jetson AGX Orin 64 GB (275 TOPS) | INT8 | ~600 fps | ~330 fps | ~150 fps | ~80 fps |
| Jetson AGX Orin 64 GB | FP16 | ~280 fps | ~160 fps | ~80 fps | ~40 fps |
| RTX 4090 | FP16 | ~2400 fps | ~1500 fps | ~800 fps | ~400 fps |
COCO accuracy (mAP@50-95): YOLOv8n 37.3, YOLOv8s 44.9, YOLOv8m 50.2, YOLOv8l 52.9, YOLOv8x 53.9. INT8 quantisation typically costs 0.5-2 mAP points relative to FP32 — usually acceptable.
Decision rule: start with YOLOv8s INT8 for any edge robot; move to YOLOv8m only if accuracy budget demands. Anything above YOLOv8l on an edge device is wasted unless you’re running it at 5-10 Hz for an analytics use case rather than reactive control.
Worked example C — ICP for 6-DoF object pose refinement
A grasp planner needs the SE(3) pose of a bottle on the table to ±2 mm. A neural pose estimator (FoundationPose, Megapose) returns a coarse initial guess T_0 with ~1 cm / 5° error. Refine with ICP against the CAD model.
Algorithm (point-to-plane ICP, Chen-Medioni 1992 / Low 2004):
-
Segment the observed point cloud P_obs (cluster around the detector’s bounding box, remove plane, remove outliers via statistical filter).
-
Transform the CAD model’s sampled point cloud P_cad by T_0.
-
For each transformed CAD point p_i, find the nearest neighbour q_i in P_obs (KD-tree, ~O(log N)).
-
Solve the linearised least-squares problem:
argmin_{ΔT} Σ_i ((ΔT · p_i − q_i) · n_i)²where n_i is the surface normal at q_i. Point-to-plane converges faster than point-to-point because it doesn’t penalise tangential sliding.
-
Compose T_0 ← ΔT · T_0; repeat until convergence (residual < threshold, e.g. 0.5 mm RMS).
Typical numbers: 30 iterations, ~0.5 mm RMS residual, converges in 20–80 ms on a CPU for ~5 000 points.
Robust variants:
- Trimmed ICP (Chetverikov 2002) — discard the worst N% of correspondences per iteration. Robust to occlusion.
- Generalised ICP (GICP) (Segal 2009) — plane-to-plane with covariance Mahalanobis distance. Excellent for noisy LiDAR.
- Color ICP (Park 2017) — adds photometric cost; great for RGB-D registration of textured surfaces.
- TEASER++ (Yang 2020) — certifiably-robust global registration before fine ICP; handles 99 % outlier rates.
When ICP fails: initial guess too far (> half the object’s characteristic length), symmetric objects (the rotation about the symmetry axis is unobservable — return a distribution over poses, not a single one), severe occlusion (< 30 % of object visible).
4. Design heuristics
Pick the model class by task
| Task | Recommended primary | Why |
|---|---|---|
| Bin-picking known SKUs | FoundationPose / Megapose + ICP refine | Pose-to-grasp, robust to clutter |
| Bin-picking unknown objects | GraspNet baseline (Fang 2020), GIGA (Jiang 2021), Contact-GraspNet | Direct grasp-pose regression from point cloud |
| Pick-and-place on conveyor | YOLOv8 detection + tracker + classical 2D pose | Fast, predictable |
| Open-vocabulary “pick the X” | SAM (segment) + CLIP (classify) + grasp net | Zero-shot; latency 200–500 ms |
| Free-space / drivable mask | Mask2Former or BEVFormer | Semantic segmentation at scene scale |
| Drone obstacle avoidance | Monocular MiDaS / Depth Anything + classical VIO | 60–120 Hz reactive |
| Object tracking across frames | BoT-SORT, ByteTrack, OC-SORT on top of YOLO | Identity persistence |
| Visual question answering | LLaVA-1.6, GPT-4V, Gemini-1.5 Pro Vision | ”What’s broken about this part?” |
| Defect inspection | PaDiM, PatchCore, SimpleNet (anomaly detection) | One-class, only good samples needed |
| Surgical instrument tracking | Sage / DETR variants on stereo video | Sub-mm in-frame accuracy |
Compute budget mapping
| Budget | Sensible architecture |
|---|---|
| ≤ 5 W (battery drone, prosthetic) | YOLOv8n / MobileNetV3, INT8 on Hailo-8 / Coral / Snapdragon NPU |
| 5–25 W (Jetson Orin Nano/NX) | YOLOv8s/m INT8, Depth Anything Small FP16, ORB-SLAM3 |
| 25–60 W (Jetson AGX Orin) | YOLOv8l-x FP16, FoundationPose, full-res Mask2Former |
| 100–300 W (workstation, RTX 4080/4090) | Full Detectron2 pipelines, NeRF / 3DGS training, ViT-Large pose |
| Cloud / data centre (A100, H100) | VLM (GPT-4V, Gemini), foundation models, training |
Synthetic data and sim-to-real
Hand-labelling 50 000 robotic-grasp poses takes a year. Synthetic generation takes a weekend.
- NVIDIA Isaac Sim + Replicator — physically-based path-traced rendering at scale, programmatic randomisation, USD asset library. Most production-grade.
- BlenderProc (Denninger 2019) — Python-scripted Blender renders; ubiquitous in academia.
- Unity Perception — game-engine renders, easy randomisation, weaker physics than Isaac.
- NVIDIA Omniverse Replicator — distributed Isaac Replicator; large-scale dataset farms.
Sim-to-real techniques:
- Domain randomisation (Tobin 2017) — randomise lighting, textures, camera intrinsics, distractor objects. Force the model to ignore everything except the task-relevant geometry.
- Photometric realism — path tracing, true HDR environment maps, PBR materials, realistic ISP simulation.
- Domain adaptation — CycleGAN-style sim→real translation; less common in 2024+ as randomisation has gotten cheap and effective.
- Self-supervised real-world fine-tune — pseudo-labels from a sim-trained teacher; modest amount of real data closes the gap.
Active perception
Don’t ask “what’s in this image?” — ask “where should I look next?” Next-best-view planning maximises information gain per camera move; for occluded bin-picking it can halve cycle time. Classical formulations (Connolly 1985, Pito 1999) computed visibility frustums; modern (2023+) work uses NeRF/3DGS posterior-uncertainty for view selection.
Data hygiene
- Label noise > 5 % quietly destroys supervised learning. Find bad labels by training a model and surfacing high-loss training examples (Northcutt 2021 “Confident Learning”).
- Class imbalance > 100:1 needs focal loss (Lin 2017) or under/over-sampling.
- Train/test distribution shift is the silent killer of robotic deployment — check that test set sensors, lighting, and operator behaviour match production.
Distillation and quantisation
- Knowledge distillation (Hinton 2015) — train YOLOv8m as teacher, use soft logits to train YOLOv8n student. Closes ~50 % of the n-vs-m accuracy gap.
- INT8 post-training quantisation — typically loses 0.5-2 mAP on COCO. Run the quantiser on a calibration set drawn from production (not COCO val).
- QAT (quantisation-aware training) — fine-tune with fake-quant nodes; recovers most of the INT8 gap.
Camera placement beats model choice
Five extra centimetres of camera height eliminate self-occlusion. A 5° downward tilt collapses depth ambiguity. Glare on a windshield camera ruins every model. Fix the physics first.
5. Components & sourcing — frameworks, models, datasets
CV libraries (general)
| Library | Strengths | Notes |
|---|---|---|
| OpenCV 4.10+ | Classical CV (calib3d, features2d, imgproc), some DNN inference | C++ / Python / Java; ubiquitous |
| PyTorch 2.4+ | Training, research, eager execution | Default in 2024+ |
| TorchVision | Detection (Faster/Mask R-CNN, RetinaNet), classification backbones | Maintained by Meta |
| Ultralytics | YOLOv5/v8/v9/v10/v11 wrapper | Trivial CLI, ONNX/TensorRT export |
| Detectron2 | Mask R-CNN, DensePose, PointRend, panoptic seg | Meta, research-grade |
| MMDetection / MMSegmentation | Modular OpenMMLab stack | 200+ model configs; Chinese academic standard |
| HuggingFace Transformers | DETR, SAM, ViTPose, depth networks, VLMs | One-line model load |
| MediaPipe | On-device hand/face/body pose | Google, real-time on mobile |
| ONNX Runtime | Cross-platform inference | CPU, CUDA, TensorRT, CoreML, DirectML EPs |
| TensorRT | NVIDIA-optimised inference | Best Jetson / GPU throughput |
| OpenVINO | Intel-optimised inference | NUC, embedded x86 |
Robotics-specific frameworks
| Framework | Role |
|---|---|
| ROS 2 + image_pipeline + vision_msgs | Camera transport, detection topic schemas (see [[Languages/Tier3/robotics-control]]) |
| NVIDIA Isaac SDK + Isaac Manipulator | Reference pipelines (FoundationPose, cuMotion) |
| NVIDIA DeepStream | Multi-camera production video analytics |
| Open3D | Point-cloud ops, ICP, RANSAC, normals, Poisson reconstruction |
| PCL (legacy) | Original point-cloud library; still used in older ROS stacks |
| GTSAM / g2o | Pose-graph and bundle adjustment back-ends |
| Ceres Solver | General nonlinear LS, used by COLMAP, ORB-SLAM3 |
| COLMAP / OpenSfM | Offline structure-from-motion |
| Kornia | Differentiable CV ops in PyTorch; learnable homographies |
6-DoF pose models and frameworks
| Model | Year | Approach | Notes |
|---|---|---|---|
| PoseCNN (Xiang 2018) | 2018 | RGB → translation + Hough quaternion | YCB-V benchmark |
| DenseFusion (Wang 2019) | 2019 | Per-pixel fusion of RGB + depth features | Strong on cluttered scenes |
| PVN3D (He 2020) | 2020 | 3D keypoints in point cloud | Cuts symmetric ambiguity |
| GDR-Net (Wang 2021) | 2021 | Geometry-guided regression on RGB | LineMod SOTA in 2021 |
| Megapose (Labbé 2022) | 2022 | Render-and-compare for unseen objects | CAD at test time only |
| FoundationPose (Wen 2024, CVPR) | 2024 | One model, any CAD, RGB-D | Production-ready zero-shot |
| GraspNet baseline (Fang 2020) | 2020 | Direct grasp pose from point cloud | 88 k grasps benchmark |
| GIGA (Jiang 2021) | 2021 | Implicit grasp affordance field | TSDF input |
| Contact-GraspNet (Sundermeyer 2021) | 2021 | Predicts contact grasp on scene cloud | Used in many cobot stacks |
Foundation models (vision)
| Model | Vendor | Use |
|---|---|---|
| SAM / SAM 2 (Kirillov 2023 / Ravi 2024) | Meta | Promptable segmentation; image + video |
| CLIP (Radford 2021) | OpenAI | Image ↔ text embedding; open-vocab classification |
| DINOv2 (Oquab 2023) | Meta | Self-supervised features; great for retrieval |
| Depth Anything V1/V2 (Yang 2024) | TikTok / HKU | Zero-shot monocular metric depth |
| GPT-4V / GPT-4o Vision | OpenAI | VLM; cloud API |
| Claude Vision (Opus 3.5+, Sonnet 4.x) | Anthropic | VLM; cloud API |
| Gemini 1.5/2.0 Pro Vision | VLM; cloud API | |
| LLaVA-1.5 / 1.6 / NeXT (Liu 2023+) | OSS | Local VLM; runs on RTX 4090 |
| BLIP-2 (Li 2023) | Salesforce | Image captioning, VQA |
| Florence-2 (Microsoft 2024) | Microsoft | Unified detection/segmentation/captioning |
Embedded inference accelerators
| Part | TOPS | Power | Notes |
|---|---|---|---|
| NVIDIA Jetson Orin Nano 8 GB | 40 (INT8) | 7-15 W | Hobby + prosumer |
| NVIDIA Jetson Orin NX 16 GB | 100 (INT8) | 10-25 W | Drone / AMR sweet-spot |
| NVIDIA Jetson AGX Orin 64 GB | 275 (INT8) | 15-60 W | Heavy perception |
| NVIDIA Thor (Drive) | ~1000 (INT8) | ~50-130 W | Production AV, 2025+ |
| Hailo-8 | 26 (INT8) | 2.5 W | M.2 / PCIe addon |
| Hailo-10H | 40 (INT8) | ~3.5 W | 2024 release |
| Google Coral Edge TPU | 4 (INT8) | 2 W | Quantised int8 only, model size capped |
| AMD/Xilinx Kria K26 / K24 | ~1-3 (DPU) | 5-15 W | FPGA-based, deterministic latency |
| Intel Movidius Myriad X / OpenVINO | ~1 (INT8) | 2 W | NUC + USB stick |
| Qualcomm RB5 / RB6 | 15 / 35+ (INT8) | 5-15 W | Snapdragon Flight drone reference |
| Apple M2/M4 Neural Engine | 16-38 (INT8) | shared | iPad / Mac robots, ARKit pipeline |
| TI TDA4VM / TDA4AEN | 8 (INT8) + DSP | 5-20 W | Automotive ECU |
Datasets
| Dataset | Domain | Notes |
|---|---|---|
| COCO (Lin 2014) | 80-class detection, segmentation, keypoints | 330 k images; the universal benchmark |
| LVIS (Gupta 2019) | 1203-class long-tail detection | COCO superset for rare-object eval |
| Open Images V7 (Kuznetsova 2018+) | 600 classes, 9 M images | Largest free-use detection dataset |
| ImageNet-21k | 21 k-class classification | Pretraining backbone |
| YCB-Video (Xiang 2017) | 21 household objects, 6-DoF pose | Bin-picking standard |
| T-LESS (Hodaň 2017) | 30 texture-less industrial objects | Hard pose benchmark |
| LineMod / LineMod-Occluded (Hinterstoisser 2012) | 15 textured objects + occlusion | Original 6-DoF benchmark |
| BOP Challenge | 11 datasets unified | Annual 6-DoF pose competition |
| KITTI (Geiger 2012) | Driving stereo + Velodyne + GPS | First AV benchmark |
| nuScenes (Caesar 2020) | 6-cam + 1-LiDAR + 5-radar | Modern AV dataset |
| Waymo Open Dataset (Sun 2020) | 5-LiDAR + 5-cam | Highest-fidelity AV |
| DeepDrive / BDD100K | 100 k driving videos | Diverse weather |
| Habitat / iGibson / Replica | Photoreal indoor sim | Embodied AI benchmarks |
| RoboNet (Dasari 2019) | 15 M robot manipulation frames | Cross-robot learning |
| Berkeley Open Manipulation | Manipulation video | Skill learning |
| Cityscapes (Cordts 2016) | Urban street segmentation | Driving semantic seg |
| ADE20K (Zhou 2017) | 150-class scene segmentation | General segmentation |
6. Reference data
Classical CV vs deep CV — when to pick which
| Aspect | Classical | Deep learning |
|---|---|---|
| Geometric ground truth (calibration, pose, SLAM front-end) | ✓ | rarely |
| Semantic understanding (object class, free space) | ✗ | ✓ |
| Out-of-domain (novel sensor, no training data) | ✓ | ✗ |
| Determinism / certification | ✓ | hard |
| Sub-pixel accuracy | ✓ (with refinement) | typically not |
| Open-vocabulary | ✗ | ✓ (VLM) |
| Compute cost | low | medium-high |
| Sensitivity to lighting | high | trained-distribution-dependent |
| Symmetry / scale ambiguity handling | by design | learned, sometimes wrong |
Detector benchmark (COCO 2017 val, single-scale)
| Detector | Backbone | mAP@50-95 | FPS (V100/A100) | Year |
|---|---|---|---|---|
| Faster R-CNN | ResNet-50 FPN | 40.2 | 26 | 2015 |
| RetinaNet | ResNet-50 FPN | 36.5 | 30 | 2017 |
| Mask R-CNN | ResNet-50 FPN | 37.9 box / 34.8 mask | 23 | 2017 |
| YOLOv4 | CSPDarknet53 | 43.5 | 65 | 2020 |
| YOLOv5l | CSPDarknet | 49.0 | 99 | 2020 |
| YOLOX-L | Modified CSPDarknet | 50.0 | 69 | 2021 |
| DETR | ResNet-50 | 42.0 | 28 | 2020 |
| Deformable DETR | ResNet-50 | 46.2 | 19 | 2021 |
| DINO-DETR | Swin-L | 63.3 | 12 | 2022 |
| RT-DETR-L | HGNetv2 | 53.0 | 114 | 2024 |
| YOLOv8n / s / m / l / x | — | 37.3 / 44.9 / 50.2 / 52.9 / 53.9 | very fast (see worked ex B) | 2023 |
| YOLOv9-C / E | — | 53.0 / 55.6 | similar | 2024 |
| YOLOv10-N / S / M / L / X | — | 38.5 / 46.3 / 51.1 / 53.2 / 54.4 | + 15-30% over v8 | 2024 |
| YOLOv11-N / S / M / L / X | — | 39.5 / 47.0 / 51.5 / 53.4 / 54.7 | similar to v10 | 2024 |
Segmentation benchmark (COCO, Cityscapes)
| Model | Task | Score | Notes |
|---|---|---|---|
| U-Net (Ronneberger 2015) | Binary biomedical seg | task-dep | small datasets |
| DeepLabV3+ (Chen 2018) | Cityscapes mIoU | 82.1 | strong outdoor |
| Mask2Former Swin-L (Cheng 2022) | COCO panoptic PQ | 57.8 | unified det + seg |
| SAM ViT-H (Kirillov 2023) | promptable, no class | n/a | 1.1 B masks pretrain |
| SAM 2 (Ravi 2024) | promptable image + video | n/a | video segmentation SOTA |
| OneFormer (Jain 2023) | universal | top-tier | single model 3 seg tasks |
Depth estimation benchmark
| Model | Type | KITTI / NYUv2 | Notes |
|---|---|---|---|
| MiDaS v3.1 (Ranftl 2020+) | Mono relative | strong | DPT backbone, robust |
| ZoeDepth (Bhat 2023) | Mono metric | 0.075 RMSE NYU | combines MiDaS + metric heads |
| Metric3D V2 (Hu 2024) | Mono metric | top-tier | unified intrinsics handling |
| Depth Anything V2 (Yang 2024) | Mono relative + metric | SOTA zero-shot | trained 62 M images |
| RAFT-Stereo (Lipson 2021) | Stereo | KITTI top | iterative GRU |
| PSMNet (Chang 2018) | Stereo | KITTI ~2 % bad-3 | pyramid cost volume |
| FoundationStereo (2024) | Stereo zero-shot | new SOTA | foundation-model approach |
| NeRF (Mildenhall 2020) | Multi-view implicit | scene-specific | training per scene |
| 3D Gaussian Splatting (Kerbl 2023) | Multi-view explicit | real-time render | takes minutes to train, ms to render |
6-DoF pose benchmark (BOP Challenge 2024 — average recall)
| Method | LineMod-O | YCB-V | T-LESS | Notes |
|---|---|---|---|---|
| PVN3D | 0.72 | 0.85 | 0.67 | depth-keypoints |
| GDR-Net | 0.71 | 0.80 | 0.68 | RGB only |
| Megapose | 0.79 | 0.85 | 0.71 | unseen objects |
| FoundationPose | 0.91 | 0.93 | 0.89 | 2024 SOTA, RGB-D |
| FoundationPose RGB-only | 0.83 | 0.86 | 0.78 | no depth required |
Accelerator inference throughput (additional, vs Sec 5)
| Model | Jetson Orin Nano INT8 | Orin NX INT8 | AGX Orin INT8 | RTX 4090 FP16 |
|---|---|---|---|---|
| YOLOv8m | 30 fps | 70 fps | 150 fps | 800 fps |
| Mask2Former Swin-S | n/a (too big) | 8 fps | 25 fps | 90 fps |
| Depth Anything V2 Small | 15 fps | 35 fps | 80 fps | 400 fps |
| FoundationPose | 4 fps | 12 fps | 35 fps | 120 fps |
| SAM ViT-B | n/a | 3 fps | 8 fps | 35 fps |
| LightGlue (1024 kpts) | 30 fps | 80 fps | 180 fps | 700 fps |
| RAFT optical flow (1080p) | 4 fps | 12 fps | 35 fps | 120 fps |
Coordinate-frame conventions
| System | +X | +Y | +Z |
|---|---|---|---|
| OpenCV camera | right | down | forward |
| ROS / REP-103 | forward | left | up |
| OpenGL / glTF | right | up | toward viewer |
| Unreal | forward | right | up |
| Blender | right | forward | up |
| Unity | right | up | forward (left-handed) |
| KITTI camera | right | down | forward (matches OpenCV) |
| KITTI velodyne | forward | left | up (matches ROS) |
Confusing these has wasted more engineer-hours than any other CV bug.
7. Failure modes & debugging
- Domain gap (sim → real) — model trained in Isaac fails on real RGB. Diagnose: side-by-side renders vs photographs; check sensor noise model, ISP gamma, lighting. Fix: domain randomisation, photometric tuning, modest fine-tune on real.
- Catastrophic forgetting during fine-tune — model loses pre-training capabilities after a few epochs on a small target dataset. Fix: lower LR, freeze backbone, mix in original-domain data (replay), or use LoRA / adapter layers.
- Model confidence ≠ accuracy — a softmax score of 0.95 means “well-trained on data like this”, not “95 % probability correct”. Calibrate with temperature scaling (Guo 2017) or, better, conformal prediction sets.
- Spurious correlation / shortcut learning — model classifies tanks by photo time of day. Diagnose with Grad-CAM, integrated gradients, or counterfactual augmentation. Fix: targeted augmentation, mask out background, balance dataset.
- Adversarial / patch attacks — a printed sticker on a stop sign defeats a detector. Robotics-safety mitigation: ensemble across modalities (LiDAR + radar + camera), out-of-distribution rejection, latency-bounded sanity checks.
- Slow per-pixel ops in Python — Looping over pixels in a NumPy array runs at 1 fps. Vectorise via
cv2.*calls, or move to GPU (CuPy, Torch tensors, CUDA streams). - Lens dirt — a fingerprint, dust, or water drop on the lens silently corrupts every downstream prediction. Sanity-check: histogram of central image patch vs corners; a per-camera “dirty-lens classifier” running at 1 Hz catches most cases.
- HDR / saturation in extreme dynamic range — sunlight through a tunnel exit saturates the sensor for 100+ ms. Mitigation: HDR sensors (Sony IMX490, 120+ dB), bracketed exposures, sensor-level tone mapping.
- Latency spikes from CPU contention — perception jitter caused by other processes. Pin CPU affinities; use
isolcpuskernel parameter; route inference exclusively to GPU/NPU. - GPU / TensorRT memory leak — long-running pipeline OOMs after hours. Pre-allocate all tensors; reuse contexts; monitor with
nvidia-smi --query-gpu=memory.used --loop=10. - Network bandwidth saturation when streaming 4K** — see
[[Robotics/sensors-perception]]worked example D. Compress on-camera (H.265, JPEG-XS), stream ROIs only, or do CV at the edge. - Coordinate-frame mistakes — see Sec 6 conventions table. Verify by projecting a known marker (AprilTag, ChArUco) and confirming the projected pixels match observation.
- Pose ambiguity for symmetric objects — bottle SO(2)-symmetric about its axis; the rotation about that axis is unobservable. Don’t return a single pose — return a distribution (one rotation per symmetry class) and let the grasp planner pick.
- Label bias / class skew — COCO has 100× more “person” annotations than “hair drier”; a model trained on COCO is biased toward common classes. For robotic deployment, train on dataset matched to deployment distribution.
- Dataset leakage — train images contain the same scene under slightly different lighting as the test set. Measured accuracy is wildly optimistic. Strict scene-level splits, not image-level.
- Photometric drift — sun angle changes; descriptor matching degrades over the day. Robust features (SuperPoint, ORB) > photometric VO (DSO) in outdoor settings.
- Face / licence-plate visibility — privacy regulation (GDPR, CCPA) forbids storing identifiable images. Blur faces and plates on-device (MediaPipe, YOLO-face) before any persistence.
- Stereo disparity collapse on textureless surfaces — blank walls, white floors, glass. Fix: active stereo (RealSense projector), LiDAR fusion, or learned matchers (RAFT-Stereo, CREStereo) that hallucinate plausible disparity.
- Pose flip on 180° symmetric objects — a cylinder with a tiny asymmetry near one end may be matched flipped 50 % of the time. Augment training with the symmetric pose and use the symmetry-aware loss (Wang 2019).
- AprilTag / ArUco false positives — fast-moving cameras motion-blur tags into false detections. Set marker decision-margin minimums; reject if reprojection error > 1 px.
8. Case studies
Boston Dynamics Stretch — warehouse box manipulation
The Stretch robot (commercial release 2022) is purpose-built for unloading 23 kg boxes from trailers at ~800 boxes/hr. Its perception stack is a Detectron2-derived custom CNN trained on millions of synthetic + real labelled box images, running on an onboard x86 compute node (no Jetson — the platform has ample power budget). The detector segments box top-faces and front-faces from a wrist-mounted RGB-D camera (Intel RealSense-class); a downstream grasp planner picks the closest visible box, then the 7-DoF “smart gripper” arm executes the grasp.
Sim-to-real comes from Boston Dynamics’ internal simulator (an Isaac Sim derivative) plus a few hundred thousand real-trailer images collected from customer deployments. The system tolerates 30+% box overlap, missing labels, banged-up cardboard, and “weird stacks” (Brazilian-pallet diagonal patterns). The vision stack runs at ~10 Hz; the bottleneck is mechanical motion, not perception.
Skydio X10 — vision-only autonomous drone
The Skydio X10 (2023) deploys six Sony IMX-class CMOS sensors in a hemispherical hex configuration, fused into a 360° depth + tracking system on an onboard NVIDIA Jetson Orin NX. The perception software is proprietary (Skydio Autonomy Engine 3+) but reportedly combines stereo from each adjacent camera pair, a learned monocular depth network for far-field, and a fused 3D occupancy grid at 30 Hz.
The signature behaviour — follow-me through dense forest at 10 m/s — relies on a person-tracker (multi-camera person re-ID + Kalman filter) plus a real-time path planner that operates on the occupancy grid. Skydio claims sub-second-latency obstacle avoidance, no LiDAR, no radar. The architectural bet — that vision plus enough cameras can replace LiDAR — has proven out for small drones (different conclusion from Boston Dynamics Spot or Waymo, where mass/power budgets favour LiDAR). Production deployments include US DoT bridge inspection, military reconnaissance (Skydio X2D), and forest firefighting.
Covariant Brain — manipulation foundation model
Covariant (acquired by Amazon, 2024) operates one of the largest robotic-manipulation foundation models in the field. Their Covariant Brain (published RSS 2022, updated 2023-2024) is a transformer-based vision-action model trained on millions of real bin-picks captured across customer warehouses (Knapp, GAP, USPS, RadialSpring). Input: RGB-D from a fixed overhead camera plus a wrist camera. Output: a grasp pose, an approach trajectory, and a confidence score.
The system claims human-level reliability on 10 000+ SKU bins with no per-SKU configuration — a sharp contrast with traditional industrial vision where every part requires CAD + manual setup. Covariant attributes this to scale: a single model trained on the union of all customer data generalises across customers, while a per-customer model trained on a 10 k-pick dataset would not. The platform shipped on RB-style cobots (Yaskawa, ABB, Universal Robots) before the Amazon acquisition broadened it onto Amazon’s internal fleet.
9. Cross-references
[[Robotics/sensors-perception]]— camera, LiDAR, depth-sensor hardware that feeds this layer; the bandwidth, sync, and calibration concerns are upstream of every model in this note.[[Robotics/slam]]— companion note (same batch); SLAM is where this note’s feature-matching, depth estimation, and pose estimation get fused into a coherent map + trajectory.[[Robotics/bayesian-estimation]]— planned; the Kalman / EKF / particle-filter mathematics for fusing detection observations over time into trackers (BoT-SORT, ByteTrack are non-Bayesian; production safety systems usually want explicit EKF).[[Robotics/end-effectors]]— planned; grasp planners consume 6-DoF poses from the FoundationPose / GraspNet outputs covered here.[[Robotics/manipulator-design]]— planned; payload, reach, and wrist DoF constrain how a vision-derived grasp can be executed.[[Robotics/path-planning]]— planned; the free-space mask and occupancy grid from segmentation feed the planner.[[Robotics/kinematics-dh]]— extrinsic camera-to-base calibration uses the same SE(3) chain math.[[Robotics/impedance-control]]— visually-servoed contact tasks blend the pose estimates here with force feedback there.[[Engineering/microcontrollers]]— embedded inference targets (NXP S32, STM32MP, Ambarella).[[Engineering/fpga-design]]— Xilinx Kria / Lattice CrossLink fixed-latency vision pipelines.[[Languages/Tier3/3d-scene]]— point-cloud / mesh formats (PLY, USD, glTF, E57) used to store CAD references for ICP and renders for synthetic data.[[Languages/Tier3/robotics-control]]—vision_msgs,sensor_msgs/Image,sensor_msgs/PointCloud2schemas; ROS 2 conventions for perception topics.[[Languages/Tier3/genai-llm-runtime]]— planned; VLM serving stacks (vLLM, TensorRT-LLM, llama.cpp) for local LLaVA / BLIP-2 inference on robots.
10. Citations
- Hartley, R. & Zisserman, A. (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. The canonical reference for projective geometry, F/E matrices, calibration.
- Szeliski, R. (2022). Computer Vision: Algorithms and Applications (2nd ed.). Springer. Free PDF at szeliski.org/Book. Definitive practitioner reference.
- Forsyth, D. A. & Ponce, J. (2011). Computer Vision: A Modern Approach (2nd ed.). Pearson.
- Lowe, D. (2004). “Distinctive Image Features from Scale-Invariant Keypoints.” IJCV 60(2), 91–110. DOI 10.1023/B:VISI.0000029664.99615.94 (SIFT).
- Bay, H., Tuytelaars, T. & Van Gool, L. (2006). “SURF: Speeded Up Robust Features.” ECCV.
- Rublee, E., Rabaud, V., Konolige, K. & Bradski, G. (2011). “ORB: An efficient alternative to SIFT or SURF.” ICCV.
- Hirschmüller, H. (2008). “Stereo Processing by Semiglobal Matching and Mutual Information.” IEEE TPAMI 30(2), 328–341.
- Nistér, D. (2004). “An Efficient Solution to the Five-Point Relative Pose Problem.” IEEE TPAMI 26(6), 756–770.
- Gao, X. S., Hou, X. R., Tang, J. & Cheng, H. F. (2003). “Complete solution classification for the perspective-three-point problem.” IEEE TPAMI 25(8), 930–943 (P3P).
- Tsai, R. & Lenz, R. (1989). “A New Technique for Fully Autonomous and Efficient 3D Robotics Hand/Eye Calibration.” IEEE Trans. Robotics & Automation 5(3), 345–358.
- Park, F. C. & Martin, B. J. (1994). “Robot Sensor Calibration: Solving AX = XB on the Euclidean Group.” IEEE Trans. RA 10(5), 717–721.
- Daniilidis, K. (1999). “Hand-Eye Calibration Using Dual Quaternions.” IJRR 18(3), 286–298.
- Besl, P. J. & McKay, N. D. (1992). “A Method for Registration of 3-D Shapes.” IEEE TPAMI 14(2), 239–256 (ICP).
- Chen, Y. & Medioni, G. (1992). “Object modelling by registration of multiple range images.” Image and Vision Computing 10(3), 145–155 (point-to-plane ICP).
- Segal, A., Haehnel, D. & Thrun, S. (2009). “Generalized-ICP.” RSS (GICP).
- Yang, H., Shi, J. & Carlone, L. (2020). “TEASER: Fast and Certifiable Point Cloud Registration.” IEEE T-RO 37(2), 314–333.
- DeTone, D., Malisiewicz, T. & Rabinovich, A. (2018). “SuperPoint: Self-Supervised Interest Point Detection and Description.” CVPRW.
- Sarlin, P.-E., DeTone, D., Malisiewicz, T. & Rabinovich, A. (2020). “SuperGlue: Learning Feature Matching with Graph Neural Networks.” CVPR.
- Lindenberger, P., Sarlin, P.-E. & Pollefeys, M. (2023). “LightGlue: Local Feature Matching at Light Speed.” ICCV.
- Sun, J. et al. (2021). “LoFTR: Detector-Free Local Feature Matching with Transformers.” CVPR.
- Ronneberger, O., Fischer, P. & Brox, T. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation.” MICCAI.
- He, K., Gkioxari, G., Dollár, P. & Girshick, R. (2017). “Mask R-CNN.” ICCV.
- Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). “Focal Loss for Dense Object Detection.” ICCV (RetinaNet).
- Carion, N. et al. (2020). “End-to-End Object Detection with Transformers.” ECCV (DETR).
- Zhao, Y. et al. (2024). “DETRs Beat YOLOs on Real-time Object Detection.” CVPR (RT-DETR).
- Cheng, B. et al. (2022). “Masked-attention Mask Transformer for Universal Image Segmentation.” CVPR (Mask2Former).
- Kirillov, A. et al. (2023). “Segment Anything.” ICCV (SAM).
- Ravi, N. et al. (2024). “SAM 2: Segment Anything in Images and Videos.” arXiv:2408.00714.
- Radford, A. et al. (2021). “Learning Transferable Visual Models from Natural Language Supervision.” ICML (CLIP).
- Oquab, M. et al. (2023). “DINOv2: Learning Robust Visual Features without Supervision.” arXiv:2304.07193.
- Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. & Koltun, V. (2020). “Towards Robust Monocular Depth Estimation.” IEEE TPAMI (MiDaS).
- Yang, L., Kang, B., Huang, Z., Zhao, Z. & Feng, J. (2024). “Depth Anything V2.” NeurIPS.
- Mildenhall, B. et al. (2020). “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” ECCV.
- Kerbl, B., Kopanas, G., Leimkühler, T. & Drettakis, G. (2023). “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” SIGGRAPH.
- Teed, Z. & Deng, J. (2020). “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow.” ECCV.
- Lipson, L., Teed, Z. & Deng, J. (2021). “RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching.” 3DV.
- Chang, J.-R. & Chen, Y.-S. (2018). “Pyramid Stereo Matching Network.” CVPR (PSMNet).
- Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. (2018). “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes.” RSS.
- Wang, C. et al. (2019). “DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion.” CVPR.
- Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M. & Fox, D. (2022). “Megapose: 6D Pose Estimation of Novel Objects via Render & Compare.” CoRL.
- Wen, B., Yang, W., Kautz, J. & Birchfield, S. (2024). “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects.” CVPR.
- Fang, H.-S., Wang, C., Gou, M. & Lu, C. (2020). “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping.” CVPR.
- Jiang, Z., Zhu, Y., Svetlik, M., Fang, K. & Zhu, Y. (2021). “Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations.” RSS (GIGA).
- Sundermeyer, M., Mousavian, A., Triebel, R. & Fox, D. (2021). “Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes.” ICRA.
- Tobin, J. et al. (2017). “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.” IROS.
- Denninger, M. et al. (2019). “BlenderProc.” arXiv:1911.01911.
- Hinton, G., Vinyals, O. & Dean, J. (2015). “Distilling the Knowledge in a Neural Network.” arXiv:1503.02531.
- Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). “On Calibration of Modern Neural Networks.” ICML.
- Northcutt, C., Jiang, L. & Chuang, I. (2021). “Confident Learning: Estimating Uncertainty in Dataset Labels.” JAIR.
- Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). “Visual Instruction Tuning.” NeurIPS (LLaVA).
- Li, J., Li, D., Savarese, S. & Hoi, S. (2023). “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs.” ICML.
- Lin, T.-Y. et al. (2014). “Microsoft COCO: Common Objects in Context.” ECCV.
- Geiger, A., Lenz, P. & Urtasun, R. (2012). “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite.” CVPR.
- Caesar, H. et al. (2020). “nuScenes: A Multimodal Dataset for Autonomous Driving.” CVPR.
- Sun, P. et al. (2020). “Scalability in Perception for Autonomous Driving: Waymo Open Dataset.” CVPR.
- Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. (2017). “YCB-Video Dataset.” (used in PoseCNN paper).
- Hodaň, T. et al. (2017). “T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects.” WACV.
- Hinterstoisser, S. et al. (2012). “Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes.” ACCV (LineMod).
- Hodaň, T. et al. (2024). “BOP Challenge 2023 / 2024 Reports.” (annual).
- OpenCV documentation, opencv.org (4.10+).
- PyTorch / TorchVision documentation, pytorch.org.
- Ultralytics YOLOv8 / v9 / v10 / v11 documentation, docs.ultralytics.com.
- Detectron2 documentation, github.com/facebookresearch/detectron2.
- NVIDIA Isaac Sim / Replicator documentation (2024).
- NVIDIA TensorRT 10.x Developer Guide.
- Open3D documentation, www.open3d.org.
- Boston Dynamics (2024). Stretch Technical Brief.
- Skydio (2023). Skydio X10 Product Brief.
- Covariant (2022). “RFM-1: A Foundation Model for Robotics.” Robotics Science and Systems Workshop on Robot Learning.