Computer Vision for Robotics

1. At a glance

Where [[Robotics/sensors-perception]] answers “what photons hit the sensor?”, this note answers “what does that pixel array mean?“. Computer vision is the algorithmic layer that converts images (and depth maps, point clouds, event streams) into the symbolic objects a planner can act on: an object class, a 6-DoF pose, a depth map, a motion vector, a free-space mask, an action label.

Modern robotic CV in 2026 divides into three coexisting paradigms:

  • Classical CV — geometry-first, hand-engineered features. Camera calibration, epipolar geometry, SIFT/ORB features, RANSAC, ICP, stereo block-matching, Lucas-Kanade flow. Still the backbone of every SLAM pipeline ([[Robotics/slam]]), every metric structure-from-motion solve, every hand-eye calibration, and every reprojection-error tracker. Deterministic, interpretable, no training data.
  • Deep learning CV — CNNs and Vision Transformers trained on labelled datasets. YOLO family for detection, U-Net / DeepLab / Mask2Former for segmentation, MiDaS / Depth Anything for monocular depth, FoundationPose / Megapose for 6-DoF pose, RAFT for optical flow. Dominant for any task with a perception-only output (detect, segment, classify, regress). Requires training data; opaque; statistically reliable on the training distribution.
  • VLM / foundation models — CLIP, SAM, GPT-4V, Gemini-Pro-Vision, LLaVA. Open-vocabulary: “find the red mug” / “describe what’s in front of the robot” / “is this object stackable on that one?” Zero or few-shot capable; high latency; expensive; the right answer when the object set is unbounded or the task is semantic.

Production robotic stacks blend all three. A bin-picker uses deep detection (YOLOv8 → “bottles here”) + classical 6-DoF refinement (ICP between segmented point cloud and CAD) + classical hand-eye calibration (Tsai-Lenz solve once at deploy). An AMR uses deep segmentation (free-space mask) + classical visual odometry (ORB + RANSAC) + VLM for occasional “what is this unknown object blocking the aisle?” queries.

First ask before architecting any robotic CV pipeline:

  1. Closed-set vs open-set? If the object list is fixed (the 47 SKUs in this warehouse), train a closed-set detector — 100× cheaper at inference than a VLM. If anything could show up, accept VLM cost.
  2. Edge or cloud compute? Edge constraints (Jetson Orin Nano, 8 W) bound model size to ~10-50 MB; cloud has no such limit but adds 100-500 ms round-trip and depends on connectivity.
  3. How wrong is “wrong”? A misclassified object in a sorting line is a returned package; a misclassified pedestrian under an AV is a death. Drives confidence calibration, redundant sensors, fallback policies.
  4. Do you have real-world training data? If not, plan sim-to-real (Isaac Replicator, BlenderProc) from day one — synthetic-only models without domain randomisation fail catastrophically on real robots.
  5. What’s the latency budget? A cobot collision predictor at 30 Hz tolerates ~33 ms inference; an autonomous-racing perception system at 200 Hz tolerates 5 ms. Pick model size to match.

2. First principles

Image formation and the pinhole model

A photon emanating from a 3D point P = (X, Y, Z) in the camera frame passes through the lens and lands on the sensor at pixel p = (u, v). Under the pinhole model:

p̃ = K · [R | t] · P̃

with K the camera intrinsic matrix:

K =  [fx  0  cx]
     [ 0 fy  cy]
     [ 0  0   1]

f_x, f_y are focal lengths in pixels, (c_x, c_y) is the principal point, and [R | t] is the extrinsic SE(3) transform from world to camera. The full chain is lens → Bayer-filter CMOS → ISP → image: photons demosaiced into RGB, white-balanced, gamma-corrected, possibly YUV-encoded for transport. The ISP is photometrically destructive; CV pipelines that care about radiometry (HDR, photometric SLAM) capture RAW Bayer upstream.

Real lenses deviate from pinhole. Brown-Conrady radial-tangential distortion is the default model in OpenCV:

x' = x·(1 + k1·r² + k2·r⁴ + k3·r⁶) + 2·p1·x·y + p2·(r² + 2·x²)
y' = y·(1 + k1·r² + k2·r⁴ + k3·r⁶) + p1·(r² + 2·y²) + 2·p2·x·y

Fisheye and wide-angle (FoV > 100°) require Kannala-Brandt (4 coeffs), MEI omnidirectional, or the double-sphere model (Usenko 2018).

Epipolar geometry — two views of a scene

Given two pinhole cameras observing the same 3D point, corresponding image points x_1 and x_2 satisfy the epipolar constraint:

x_2^T · F · x_1 = 0

with F the 3×3 fundamental matrix (rank 2). If both intrinsics are known, normalise points and use the essential matrix:

E = K_2^T · F · K_1     and     E = [t]_× · R

decomposing E via SVD gives four (R, t) candidates — disambiguate by chirality (the reconstructed 3D points must be in front of both cameras). This is the foundation of stereo, structure-from-motion, and the front-end of visual SLAM. The 5-point relative-pose solver (Nistér 2004) is the fastest minimal solver; 7-point and 8-point algorithms exist for the uncalibrated case.

Classical feature detectors and descriptors

A feature is a local image patch distinctive enough to match across views.

  • Harris corners (1988) — detect points with high curvature in both image directions; the eigenvalue ratio of the local structure tensor.
  • SIFT (Lowe 2004) — DoG scale-space extrema + 128-D gradient-histogram descriptor. Scale, rotation, and partially illumination invariant. Patent expired 2020; now the geometric gold standard.
  • SURF (Bay 2006) — Hessian-based, faster SIFT. Patent still relevant in commercial deployments.
  • ORB (Rublee 2011, OpenCV’s default) — FAST keypoints + rotated BRIEF binary descriptor. ~10× faster than SIFT, fixed-length 256-bit binary descriptor matchable with Hamming distance. ORB-SLAM family is built on this.
  • AKAZE (Alcantarilla 2013) — nonlinear scale-space, M-LDB descriptor. Better invariance than ORB at modest extra cost.

Matching: nearest-neighbour in descriptor space, with Lowe’s ratio test (accept if d_1 / d_2 < 0.7-0.8) to reject ambiguous matches, then RANSAC to filter geometric outliers (typically fitting F, E, or a homography).

Modern (learned) features

A 2018-onward shift: replace hand-crafted SIFT/ORB with networks trained for repeatability and matchability.

  • SuperPoint (DeTone 2018) — self-supervised joint detector + descriptor, runs at 70 fps on a 2060.
  • R2D2 (Revaud 2019) — adds explicit reliability/repeatability scoring.
  • DISK (Tyszkiewicz 2020) — reinforcement-learning trained.
  • SuperGlue (Sarlin 2020) — graph-neural-network matcher over SuperPoint keypoints; massive accuracy gain over nearest-neighbour matching.
  • LightGlue (Lindenberger 2023) — same idea, ~5-10× faster; production-deployable on Jetson.
  • LoFTR (Sun 2021) — detector-free, dense transformer matching; great on textureless surfaces.

Classical features still win on raw speed and on radically out-of-domain imagery (medical, thermal, microscopy) where there’s no training data.

3. Practical math / worked examples

Worked example A — Hand-eye calibration for arm + wrist camera

The setup: a 6-DoF industrial arm with a camera bolted to the wrist. The camera sees a checkerboard fixed in the world. We need the constant SE(3) transform X = camera-to-gripper, so that knowing gripper pose lets us project camera observations into the world.

Procedure (Tsai-Lenz 1989):

  1. Move the arm to 20 distinct poses, each with the checkerboard fully visible. Vary all 6 DoF; in particular include rotations about all three axes — pure translations leave X under-determined.

  2. At each pose, capture an image and solve PnP (cv2.solvePnP with the IPPE_SQUARE or SQPNP solver) to get camera-to-checkerboard transform T_cb_i.

  3. Record the robot’s reported gripper-to-base transform T_gb_i from the controller.

  4. Form pairs of relative motions:

    A_i = T_gb_{i+1}^{-1} · T_gb_i        (gripper motion between poses)
    B_i = T_cb_{i+1}     · T_cb_i^{-1}     (camera motion between poses)
    
  5. Solve A_i · X = X · B_i for the unknown X. Tsai-Lenz decouples rotation and translation: first solve R_X via Rodrigues vectors (linear system over all i), then t_X (linear).

  6. Verify by computing residuals on held-out poses.

Typical residuals:

Robot classTranslational residualAngular residual
Industrial arm (UR5e, KUKA KR6, Fanuc LR Mate)0.5–2 mm0.05–0.2°
Cobot (Franka Panda, UR10e)1–3 mm0.1–0.3°
Educational (UFactory xArm)2–5 mm0.2–0.5°
Mobile manipulator (Spot arm)5–10 mm0.3–0.8°

Alternative solvers: Park-Martin 1994 (Lie-algebra formulation), Daniilidis 1999 (dual quaternions, jointly solves R and t — best when rotation magnitudes are small). All three are in OpenCV’s cv2.calibrateHandEye (HANDEYE_TSAI, HANDEYE_PARK, HANDEYE_DANIILIDIS).

Failure modes: all 20 poses share a near-common axis of rotation (degenerate); checkerboard not rigidly fixed; gripper pose reports stale value (low-latency capture sync matters).

Worked example B — YOLOv8 inference on Jetson

YOLOv8 (Ultralytics, 2023) is the de-facto detector on edge robots in 2024-2026. Five model sizes — nano, small, medium, large, x-large — span from 3.2 M to 68 M parameters.

Workflow:

  1. Train (or download) FP32 PyTorch weights (.pt).
  2. Export to ONNX: yolo export model=yolov8m.pt format=onnx imgsz=640 simplify=True.
  3. Build TensorRT engine: trtexec --onnx=yolov8m.onnx --saveEngine=yolov8m.engine --fp16 (or --int8 --calib=calib.txt for INT8).
  4. Run inference via TensorRT Python API or Ultralytics model.predict() with the engine.

Measured throughput (640×640 input, 1 image batch, COCO 80 classes):

PlatformPrecisionYOLOv8nYOLOv8sYOLOv8mYOLOv8l
Jetson Orin Nano (40 TOPS INT8)INT8~130 fps~70 fps~30 fps~15 fps
Jetson Orin NX 16 GB (100 TOPS)INT8~280 fps~150 fps~70 fps~35 fps
Jetson AGX Orin 64 GB (275 TOPS)INT8~600 fps~330 fps~150 fps~80 fps
Jetson AGX Orin 64 GBFP16~280 fps~160 fps~80 fps~40 fps
RTX 4090FP16~2400 fps~1500 fps~800 fps~400 fps

COCO accuracy (mAP@50-95): YOLOv8n 37.3, YOLOv8s 44.9, YOLOv8m 50.2, YOLOv8l 52.9, YOLOv8x 53.9. INT8 quantisation typically costs 0.5-2 mAP points relative to FP32 — usually acceptable.

Decision rule: start with YOLOv8s INT8 for any edge robot; move to YOLOv8m only if accuracy budget demands. Anything above YOLOv8l on an edge device is wasted unless you’re running it at 5-10 Hz for an analytics use case rather than reactive control.

Worked example C — ICP for 6-DoF object pose refinement

A grasp planner needs the SE(3) pose of a bottle on the table to ±2 mm. A neural pose estimator (FoundationPose, Megapose) returns a coarse initial guess T_0 with ~1 cm / 5° error. Refine with ICP against the CAD model.

Algorithm (point-to-plane ICP, Chen-Medioni 1992 / Low 2004):

  1. Segment the observed point cloud P_obs (cluster around the detector’s bounding box, remove plane, remove outliers via statistical filter).

  2. Transform the CAD model’s sampled point cloud P_cad by T_0.

  3. For each transformed CAD point p_i, find the nearest neighbour q_i in P_obs (KD-tree, ~O(log N)).

  4. Solve the linearised least-squares problem:

    argmin_{ΔT}  Σ_i  ((ΔT · p_i − q_i) · n_i)²
    

    where n_i is the surface normal at q_i. Point-to-plane converges faster than point-to-point because it doesn’t penalise tangential sliding.

  5. Compose T_0 ← ΔT · T_0; repeat until convergence (residual < threshold, e.g. 0.5 mm RMS).

Typical numbers: 30 iterations, ~0.5 mm RMS residual, converges in 20–80 ms on a CPU for ~5 000 points.

Robust variants:

  • Trimmed ICP (Chetverikov 2002) — discard the worst N% of correspondences per iteration. Robust to occlusion.
  • Generalised ICP (GICP) (Segal 2009) — plane-to-plane with covariance Mahalanobis distance. Excellent for noisy LiDAR.
  • Color ICP (Park 2017) — adds photometric cost; great for RGB-D registration of textured surfaces.
  • TEASER++ (Yang 2020) — certifiably-robust global registration before fine ICP; handles 99 % outlier rates.

When ICP fails: initial guess too far (> half the object’s characteristic length), symmetric objects (the rotation about the symmetry axis is unobservable — return a distribution over poses, not a single one), severe occlusion (< 30 % of object visible).

4. Design heuristics

Pick the model class by task

TaskRecommended primaryWhy
Bin-picking known SKUsFoundationPose / Megapose + ICP refinePose-to-grasp, robust to clutter
Bin-picking unknown objectsGraspNet baseline (Fang 2020), GIGA (Jiang 2021), Contact-GraspNetDirect grasp-pose regression from point cloud
Pick-and-place on conveyorYOLOv8 detection + tracker + classical 2D poseFast, predictable
Open-vocabulary “pick the X”SAM (segment) + CLIP (classify) + grasp netZero-shot; latency 200–500 ms
Free-space / drivable maskMask2Former or BEVFormerSemantic segmentation at scene scale
Drone obstacle avoidanceMonocular MiDaS / Depth Anything + classical VIO60–120 Hz reactive
Object tracking across framesBoT-SORT, ByteTrack, OC-SORT on top of YOLOIdentity persistence
Visual question answeringLLaVA-1.6, GPT-4V, Gemini-1.5 Pro Vision”What’s broken about this part?”
Defect inspectionPaDiM, PatchCore, SimpleNet (anomaly detection)One-class, only good samples needed
Surgical instrument trackingSage / DETR variants on stereo videoSub-mm in-frame accuracy

Compute budget mapping

BudgetSensible architecture
≤ 5 W (battery drone, prosthetic)YOLOv8n / MobileNetV3, INT8 on Hailo-8 / Coral / Snapdragon NPU
5–25 W (Jetson Orin Nano/NX)YOLOv8s/m INT8, Depth Anything Small FP16, ORB-SLAM3
25–60 W (Jetson AGX Orin)YOLOv8l-x FP16, FoundationPose, full-res Mask2Former
100–300 W (workstation, RTX 4080/4090)Full Detectron2 pipelines, NeRF / 3DGS training, ViT-Large pose
Cloud / data centre (A100, H100)VLM (GPT-4V, Gemini), foundation models, training

Synthetic data and sim-to-real

Hand-labelling 50 000 robotic-grasp poses takes a year. Synthetic generation takes a weekend.

  • NVIDIA Isaac Sim + Replicator — physically-based path-traced rendering at scale, programmatic randomisation, USD asset library. Most production-grade.
  • BlenderProc (Denninger 2019) — Python-scripted Blender renders; ubiquitous in academia.
  • Unity Perception — game-engine renders, easy randomisation, weaker physics than Isaac.
  • NVIDIA Omniverse Replicator — distributed Isaac Replicator; large-scale dataset farms.

Sim-to-real techniques:

  1. Domain randomisation (Tobin 2017) — randomise lighting, textures, camera intrinsics, distractor objects. Force the model to ignore everything except the task-relevant geometry.
  2. Photometric realism — path tracing, true HDR environment maps, PBR materials, realistic ISP simulation.
  3. Domain adaptation — CycleGAN-style sim→real translation; less common in 2024+ as randomisation has gotten cheap and effective.
  4. Self-supervised real-world fine-tune — pseudo-labels from a sim-trained teacher; modest amount of real data closes the gap.

Active perception

Don’t ask “what’s in this image?” — ask “where should I look next?” Next-best-view planning maximises information gain per camera move; for occluded bin-picking it can halve cycle time. Classical formulations (Connolly 1985, Pito 1999) computed visibility frustums; modern (2023+) work uses NeRF/3DGS posterior-uncertainty for view selection.

Data hygiene

  • Label noise > 5 % quietly destroys supervised learning. Find bad labels by training a model and surfacing high-loss training examples (Northcutt 2021 “Confident Learning”).
  • Class imbalance > 100:1 needs focal loss (Lin 2017) or under/over-sampling.
  • Train/test distribution shift is the silent killer of robotic deployment — check that test set sensors, lighting, and operator behaviour match production.

Distillation and quantisation

  • Knowledge distillation (Hinton 2015) — train YOLOv8m as teacher, use soft logits to train YOLOv8n student. Closes ~50 % of the n-vs-m accuracy gap.
  • INT8 post-training quantisation — typically loses 0.5-2 mAP on COCO. Run the quantiser on a calibration set drawn from production (not COCO val).
  • QAT (quantisation-aware training) — fine-tune with fake-quant nodes; recovers most of the INT8 gap.

Camera placement beats model choice

Five extra centimetres of camera height eliminate self-occlusion. A 5° downward tilt collapses depth ambiguity. Glare on a windshield camera ruins every model. Fix the physics first.

5. Components & sourcing — frameworks, models, datasets

CV libraries (general)

LibraryStrengthsNotes
OpenCV 4.10+Classical CV (calib3d, features2d, imgproc), some DNN inferenceC++ / Python / Java; ubiquitous
PyTorch 2.4+Training, research, eager executionDefault in 2024+
TorchVisionDetection (Faster/Mask R-CNN, RetinaNet), classification backbonesMaintained by Meta
UltralyticsYOLOv5/v8/v9/v10/v11 wrapperTrivial CLI, ONNX/TensorRT export
Detectron2Mask R-CNN, DensePose, PointRend, panoptic segMeta, research-grade
MMDetection / MMSegmentationModular OpenMMLab stack200+ model configs; Chinese academic standard
HuggingFace TransformersDETR, SAM, ViTPose, depth networks, VLMsOne-line model load
MediaPipeOn-device hand/face/body poseGoogle, real-time on mobile
ONNX RuntimeCross-platform inferenceCPU, CUDA, TensorRT, CoreML, DirectML EPs
TensorRTNVIDIA-optimised inferenceBest Jetson / GPU throughput
OpenVINOIntel-optimised inferenceNUC, embedded x86

Robotics-specific frameworks

FrameworkRole
ROS 2 + image_pipeline + vision_msgsCamera transport, detection topic schemas (see [[Languages/Tier3/robotics-control]])
NVIDIA Isaac SDK + Isaac ManipulatorReference pipelines (FoundationPose, cuMotion)
NVIDIA DeepStreamMulti-camera production video analytics
Open3DPoint-cloud ops, ICP, RANSAC, normals, Poisson reconstruction
PCL (legacy)Original point-cloud library; still used in older ROS stacks
GTSAM / g2oPose-graph and bundle adjustment back-ends
Ceres SolverGeneral nonlinear LS, used by COLMAP, ORB-SLAM3
COLMAP / OpenSfMOffline structure-from-motion
KorniaDifferentiable CV ops in PyTorch; learnable homographies

6-DoF pose models and frameworks

ModelYearApproachNotes
PoseCNN (Xiang 2018)2018RGB → translation + Hough quaternionYCB-V benchmark
DenseFusion (Wang 2019)2019Per-pixel fusion of RGB + depth featuresStrong on cluttered scenes
PVN3D (He 2020)20203D keypoints in point cloudCuts symmetric ambiguity
GDR-Net (Wang 2021)2021Geometry-guided regression on RGBLineMod SOTA in 2021
Megapose (Labbé 2022)2022Render-and-compare for unseen objectsCAD at test time only
FoundationPose (Wen 2024, CVPR)2024One model, any CAD, RGB-DProduction-ready zero-shot
GraspNet baseline (Fang 2020)2020Direct grasp pose from point cloud88 k grasps benchmark
GIGA (Jiang 2021)2021Implicit grasp affordance fieldTSDF input
Contact-GraspNet (Sundermeyer 2021)2021Predicts contact grasp on scene cloudUsed in many cobot stacks

Foundation models (vision)

ModelVendorUse
SAM / SAM 2 (Kirillov 2023 / Ravi 2024)MetaPromptable segmentation; image + video
CLIP (Radford 2021)OpenAIImage ↔ text embedding; open-vocab classification
DINOv2 (Oquab 2023)MetaSelf-supervised features; great for retrieval
Depth Anything V1/V2 (Yang 2024)TikTok / HKUZero-shot monocular metric depth
GPT-4V / GPT-4o VisionOpenAIVLM; cloud API
Claude Vision (Opus 3.5+, Sonnet 4.x)AnthropicVLM; cloud API
Gemini 1.5/2.0 Pro VisionGoogleVLM; cloud API
LLaVA-1.5 / 1.6 / NeXT (Liu 2023+)OSSLocal VLM; runs on RTX 4090
BLIP-2 (Li 2023)SalesforceImage captioning, VQA
Florence-2 (Microsoft 2024)MicrosoftUnified detection/segmentation/captioning

Embedded inference accelerators

PartTOPSPowerNotes
NVIDIA Jetson Orin Nano 8 GB40 (INT8)7-15 WHobby + prosumer
NVIDIA Jetson Orin NX 16 GB100 (INT8)10-25 WDrone / AMR sweet-spot
NVIDIA Jetson AGX Orin 64 GB275 (INT8)15-60 WHeavy perception
NVIDIA Thor (Drive)~1000 (INT8)~50-130 WProduction AV, 2025+
Hailo-826 (INT8)2.5 WM.2 / PCIe addon
Hailo-10H40 (INT8)~3.5 W2024 release
Google Coral Edge TPU4 (INT8)2 WQuantised int8 only, model size capped
AMD/Xilinx Kria K26 / K24~1-3 (DPU)5-15 WFPGA-based, deterministic latency
Intel Movidius Myriad X / OpenVINO~1 (INT8)2 WNUC + USB stick
Qualcomm RB5 / RB615 / 35+ (INT8)5-15 WSnapdragon Flight drone reference
Apple M2/M4 Neural Engine16-38 (INT8)sharediPad / Mac robots, ARKit pipeline
TI TDA4VM / TDA4AEN8 (INT8) + DSP5-20 WAutomotive ECU

Datasets

DatasetDomainNotes
COCO (Lin 2014)80-class detection, segmentation, keypoints330 k images; the universal benchmark
LVIS (Gupta 2019)1203-class long-tail detectionCOCO superset for rare-object eval
Open Images V7 (Kuznetsova 2018+)600 classes, 9 M imagesLargest free-use detection dataset
ImageNet-21k21 k-class classificationPretraining backbone
YCB-Video (Xiang 2017)21 household objects, 6-DoF poseBin-picking standard
T-LESS (Hodaň 2017)30 texture-less industrial objectsHard pose benchmark
LineMod / LineMod-Occluded (Hinterstoisser 2012)15 textured objects + occlusionOriginal 6-DoF benchmark
BOP Challenge11 datasets unifiedAnnual 6-DoF pose competition
KITTI (Geiger 2012)Driving stereo + Velodyne + GPSFirst AV benchmark
nuScenes (Caesar 2020)6-cam + 1-LiDAR + 5-radarModern AV dataset
Waymo Open Dataset (Sun 2020)5-LiDAR + 5-camHighest-fidelity AV
DeepDrive / BDD100K100 k driving videosDiverse weather
Habitat / iGibson / ReplicaPhotoreal indoor simEmbodied AI benchmarks
RoboNet (Dasari 2019)15 M robot manipulation framesCross-robot learning
Berkeley Open ManipulationManipulation videoSkill learning
Cityscapes (Cordts 2016)Urban street segmentationDriving semantic seg
ADE20K (Zhou 2017)150-class scene segmentationGeneral segmentation

6. Reference data

Classical CV vs deep CV — when to pick which

AspectClassicalDeep learning
Geometric ground truth (calibration, pose, SLAM front-end)rarely
Semantic understanding (object class, free space)
Out-of-domain (novel sensor, no training data)
Determinism / certificationhard
Sub-pixel accuracy✓ (with refinement)typically not
Open-vocabulary✓ (VLM)
Compute costlowmedium-high
Sensitivity to lightinghightrained-distribution-dependent
Symmetry / scale ambiguity handlingby designlearned, sometimes wrong

Detector benchmark (COCO 2017 val, single-scale)

DetectorBackbonemAP@50-95FPS (V100/A100)Year
Faster R-CNNResNet-50 FPN40.2262015
RetinaNetResNet-50 FPN36.5302017
Mask R-CNNResNet-50 FPN37.9 box / 34.8 mask232017
YOLOv4CSPDarknet5343.5652020
YOLOv5lCSPDarknet49.0992020
YOLOX-LModified CSPDarknet50.0692021
DETRResNet-5042.0282020
Deformable DETRResNet-5046.2192021
DINO-DETRSwin-L63.3122022
RT-DETR-LHGNetv253.01142024
YOLOv8n / s / m / l / x37.3 / 44.9 / 50.2 / 52.9 / 53.9very fast (see worked ex B)2023
YOLOv9-C / E53.0 / 55.6similar2024
YOLOv10-N / S / M / L / X38.5 / 46.3 / 51.1 / 53.2 / 54.4+ 15-30% over v82024
YOLOv11-N / S / M / L / X39.5 / 47.0 / 51.5 / 53.4 / 54.7similar to v102024

Segmentation benchmark (COCO, Cityscapes)

ModelTaskScoreNotes
U-Net (Ronneberger 2015)Binary biomedical segtask-depsmall datasets
DeepLabV3+ (Chen 2018)Cityscapes mIoU82.1strong outdoor
Mask2Former Swin-L (Cheng 2022)COCO panoptic PQ57.8unified det + seg
SAM ViT-H (Kirillov 2023)promptable, no classn/a1.1 B masks pretrain
SAM 2 (Ravi 2024)promptable image + videon/avideo segmentation SOTA
OneFormer (Jain 2023)universaltop-tiersingle model 3 seg tasks

Depth estimation benchmark

ModelTypeKITTI / NYUv2Notes
MiDaS v3.1 (Ranftl 2020+)Mono relativestrongDPT backbone, robust
ZoeDepth (Bhat 2023)Mono metric0.075 RMSE NYUcombines MiDaS + metric heads
Metric3D V2 (Hu 2024)Mono metrictop-tierunified intrinsics handling
Depth Anything V2 (Yang 2024)Mono relative + metricSOTA zero-shottrained 62 M images
RAFT-Stereo (Lipson 2021)StereoKITTI topiterative GRU
PSMNet (Chang 2018)StereoKITTI ~2 % bad-3pyramid cost volume
FoundationStereo (2024)Stereo zero-shotnew SOTAfoundation-model approach
NeRF (Mildenhall 2020)Multi-view implicitscene-specifictraining per scene
3D Gaussian Splatting (Kerbl 2023)Multi-view explicitreal-time rendertakes minutes to train, ms to render

6-DoF pose benchmark (BOP Challenge 2024 — average recall)

MethodLineMod-OYCB-VT-LESSNotes
PVN3D0.720.850.67depth-keypoints
GDR-Net0.710.800.68RGB only
Megapose0.790.850.71unseen objects
FoundationPose0.910.930.892024 SOTA, RGB-D
FoundationPose RGB-only0.830.860.78no depth required

Accelerator inference throughput (additional, vs Sec 5)

ModelJetson Orin Nano INT8Orin NX INT8AGX Orin INT8RTX 4090 FP16
YOLOv8m30 fps70 fps150 fps800 fps
Mask2Former Swin-Sn/a (too big)8 fps25 fps90 fps
Depth Anything V2 Small15 fps35 fps80 fps400 fps
FoundationPose4 fps12 fps35 fps120 fps
SAM ViT-Bn/a3 fps8 fps35 fps
LightGlue (1024 kpts)30 fps80 fps180 fps700 fps
RAFT optical flow (1080p)4 fps12 fps35 fps120 fps

Coordinate-frame conventions

System+X+Y+Z
OpenCV camerarightdownforward
ROS / REP-103forwardleftup
OpenGL / glTFrightuptoward viewer
Unrealforwardrightup
Blenderrightforwardup
Unityrightupforward (left-handed)
KITTI camerarightdownforward (matches OpenCV)
KITTI velodyneforwardleftup (matches ROS)

Confusing these has wasted more engineer-hours than any other CV bug.

7. Failure modes & debugging

  • Domain gap (sim → real) — model trained in Isaac fails on real RGB. Diagnose: side-by-side renders vs photographs; check sensor noise model, ISP gamma, lighting. Fix: domain randomisation, photometric tuning, modest fine-tune on real.
  • Catastrophic forgetting during fine-tune — model loses pre-training capabilities after a few epochs on a small target dataset. Fix: lower LR, freeze backbone, mix in original-domain data (replay), or use LoRA / adapter layers.
  • Model confidence ≠ accuracy — a softmax score of 0.95 means “well-trained on data like this”, not “95 % probability correct”. Calibrate with temperature scaling (Guo 2017) or, better, conformal prediction sets.
  • Spurious correlation / shortcut learning — model classifies tanks by photo time of day. Diagnose with Grad-CAM, integrated gradients, or counterfactual augmentation. Fix: targeted augmentation, mask out background, balance dataset.
  • Adversarial / patch attacks — a printed sticker on a stop sign defeats a detector. Robotics-safety mitigation: ensemble across modalities (LiDAR + radar + camera), out-of-distribution rejection, latency-bounded sanity checks.
  • Slow per-pixel ops in Python — Looping over pixels in a NumPy array runs at 1 fps. Vectorise via cv2.* calls, or move to GPU (CuPy, Torch tensors, CUDA streams).
  • Lens dirt — a fingerprint, dust, or water drop on the lens silently corrupts every downstream prediction. Sanity-check: histogram of central image patch vs corners; a per-camera “dirty-lens classifier” running at 1 Hz catches most cases.
  • HDR / saturation in extreme dynamic range — sunlight through a tunnel exit saturates the sensor for 100+ ms. Mitigation: HDR sensors (Sony IMX490, 120+ dB), bracketed exposures, sensor-level tone mapping.
  • Latency spikes from CPU contention — perception jitter caused by other processes. Pin CPU affinities; use isolcpus kernel parameter; route inference exclusively to GPU/NPU.
  • GPU / TensorRT memory leak — long-running pipeline OOMs after hours. Pre-allocate all tensors; reuse contexts; monitor with nvidia-smi --query-gpu=memory.used --loop=10.
  • Network bandwidth saturation when streaming 4K** — see [[Robotics/sensors-perception]] worked example D. Compress on-camera (H.265, JPEG-XS), stream ROIs only, or do CV at the edge.
  • Coordinate-frame mistakes — see Sec 6 conventions table. Verify by projecting a known marker (AprilTag, ChArUco) and confirming the projected pixels match observation.
  • Pose ambiguity for symmetric objects — bottle SO(2)-symmetric about its axis; the rotation about that axis is unobservable. Don’t return a single pose — return a distribution (one rotation per symmetry class) and let the grasp planner pick.
  • Label bias / class skew — COCO has 100× more “person” annotations than “hair drier”; a model trained on COCO is biased toward common classes. For robotic deployment, train on dataset matched to deployment distribution.
  • Dataset leakage — train images contain the same scene under slightly different lighting as the test set. Measured accuracy is wildly optimistic. Strict scene-level splits, not image-level.
  • Photometric drift — sun angle changes; descriptor matching degrades over the day. Robust features (SuperPoint, ORB) > photometric VO (DSO) in outdoor settings.
  • Face / licence-plate visibility — privacy regulation (GDPR, CCPA) forbids storing identifiable images. Blur faces and plates on-device (MediaPipe, YOLO-face) before any persistence.
  • Stereo disparity collapse on textureless surfaces — blank walls, white floors, glass. Fix: active stereo (RealSense projector), LiDAR fusion, or learned matchers (RAFT-Stereo, CREStereo) that hallucinate plausible disparity.
  • Pose flip on 180° symmetric objects — a cylinder with a tiny asymmetry near one end may be matched flipped 50 % of the time. Augment training with the symmetric pose and use the symmetry-aware loss (Wang 2019).
  • AprilTag / ArUco false positives — fast-moving cameras motion-blur tags into false detections. Set marker decision-margin minimums; reject if reprojection error > 1 px.

8. Case studies

Boston Dynamics Stretch — warehouse box manipulation

The Stretch robot (commercial release 2022) is purpose-built for unloading 23 kg boxes from trailers at ~800 boxes/hr. Its perception stack is a Detectron2-derived custom CNN trained on millions of synthetic + real labelled box images, running on an onboard x86 compute node (no Jetson — the platform has ample power budget). The detector segments box top-faces and front-faces from a wrist-mounted RGB-D camera (Intel RealSense-class); a downstream grasp planner picks the closest visible box, then the 7-DoF “smart gripper” arm executes the grasp.

Sim-to-real comes from Boston Dynamics’ internal simulator (an Isaac Sim derivative) plus a few hundred thousand real-trailer images collected from customer deployments. The system tolerates 30+% box overlap, missing labels, banged-up cardboard, and “weird stacks” (Brazilian-pallet diagonal patterns). The vision stack runs at ~10 Hz; the bottleneck is mechanical motion, not perception.

Skydio X10 — vision-only autonomous drone

The Skydio X10 (2023) deploys six Sony IMX-class CMOS sensors in a hemispherical hex configuration, fused into a 360° depth + tracking system on an onboard NVIDIA Jetson Orin NX. The perception software is proprietary (Skydio Autonomy Engine 3+) but reportedly combines stereo from each adjacent camera pair, a learned monocular depth network for far-field, and a fused 3D occupancy grid at 30 Hz.

The signature behaviour — follow-me through dense forest at 10 m/s — relies on a person-tracker (multi-camera person re-ID + Kalman filter) plus a real-time path planner that operates on the occupancy grid. Skydio claims sub-second-latency obstacle avoidance, no LiDAR, no radar. The architectural bet — that vision plus enough cameras can replace LiDAR — has proven out for small drones (different conclusion from Boston Dynamics Spot or Waymo, where mass/power budgets favour LiDAR). Production deployments include US DoT bridge inspection, military reconnaissance (Skydio X2D), and forest firefighting.

Covariant Brain — manipulation foundation model

Covariant (acquired by Amazon, 2024) operates one of the largest robotic-manipulation foundation models in the field. Their Covariant Brain (published RSS 2022, updated 2023-2024) is a transformer-based vision-action model trained on millions of real bin-picks captured across customer warehouses (Knapp, GAP, USPS, RadialSpring). Input: RGB-D from a fixed overhead camera plus a wrist camera. Output: a grasp pose, an approach trajectory, and a confidence score.

The system claims human-level reliability on 10 000+ SKU bins with no per-SKU configuration — a sharp contrast with traditional industrial vision where every part requires CAD + manual setup. Covariant attributes this to scale: a single model trained on the union of all customer data generalises across customers, while a per-customer model trained on a 10 k-pick dataset would not. The platform shipped on RB-style cobots (Yaskawa, ABB, Universal Robots) before the Amazon acquisition broadened it onto Amazon’s internal fleet.

9. Cross-references

  • [[Robotics/sensors-perception]] — camera, LiDAR, depth-sensor hardware that feeds this layer; the bandwidth, sync, and calibration concerns are upstream of every model in this note.
  • [[Robotics/slam]] — companion note (same batch); SLAM is where this note’s feature-matching, depth estimation, and pose estimation get fused into a coherent map + trajectory.
  • [[Robotics/bayesian-estimation]] — planned; the Kalman / EKF / particle-filter mathematics for fusing detection observations over time into trackers (BoT-SORT, ByteTrack are non-Bayesian; production safety systems usually want explicit EKF).
  • [[Robotics/end-effectors]] — planned; grasp planners consume 6-DoF poses from the FoundationPose / GraspNet outputs covered here.
  • [[Robotics/manipulator-design]] — planned; payload, reach, and wrist DoF constrain how a vision-derived grasp can be executed.
  • [[Robotics/path-planning]] — planned; the free-space mask and occupancy grid from segmentation feed the planner.
  • [[Robotics/kinematics-dh]] — extrinsic camera-to-base calibration uses the same SE(3) chain math.
  • [[Robotics/impedance-control]] — visually-servoed contact tasks blend the pose estimates here with force feedback there.
  • [[Engineering/microcontrollers]] — embedded inference targets (NXP S32, STM32MP, Ambarella).
  • [[Engineering/fpga-design]] — Xilinx Kria / Lattice CrossLink fixed-latency vision pipelines.
  • [[Languages/Tier3/3d-scene]] — point-cloud / mesh formats (PLY, USD, glTF, E57) used to store CAD references for ICP and renders for synthetic data.
  • [[Languages/Tier3/robotics-control]]vision_msgs, sensor_msgs/Image, sensor_msgs/PointCloud2 schemas; ROS 2 conventions for perception topics.
  • [[Languages/Tier3/genai-llm-runtime]] — planned; VLM serving stacks (vLLM, TensorRT-LLM, llama.cpp) for local LLaVA / BLIP-2 inference on robots.

10. Citations

  • Hartley, R. & Zisserman, A. (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. The canonical reference for projective geometry, F/E matrices, calibration.
  • Szeliski, R. (2022). Computer Vision: Algorithms and Applications (2nd ed.). Springer. Free PDF at szeliski.org/Book. Definitive practitioner reference.
  • Forsyth, D. A. & Ponce, J. (2011). Computer Vision: A Modern Approach (2nd ed.). Pearson.
  • Lowe, D. (2004). “Distinctive Image Features from Scale-Invariant Keypoints.” IJCV 60(2), 91–110. DOI 10.1023/B:VISI.0000029664.99615.94 (SIFT).
  • Bay, H., Tuytelaars, T. & Van Gool, L. (2006). “SURF: Speeded Up Robust Features.” ECCV.
  • Rublee, E., Rabaud, V., Konolige, K. & Bradski, G. (2011). “ORB: An efficient alternative to SIFT or SURF.” ICCV.
  • Hirschmüller, H. (2008). “Stereo Processing by Semiglobal Matching and Mutual Information.” IEEE TPAMI 30(2), 328–341.
  • Nistér, D. (2004). “An Efficient Solution to the Five-Point Relative Pose Problem.” IEEE TPAMI 26(6), 756–770.
  • Gao, X. S., Hou, X. R., Tang, J. & Cheng, H. F. (2003). “Complete solution classification for the perspective-three-point problem.” IEEE TPAMI 25(8), 930–943 (P3P).
  • Tsai, R. & Lenz, R. (1989). “A New Technique for Fully Autonomous and Efficient 3D Robotics Hand/Eye Calibration.” IEEE Trans. Robotics & Automation 5(3), 345–358.
  • Park, F. C. & Martin, B. J. (1994). “Robot Sensor Calibration: Solving AX = XB on the Euclidean Group.” IEEE Trans. RA 10(5), 717–721.
  • Daniilidis, K. (1999). “Hand-Eye Calibration Using Dual Quaternions.” IJRR 18(3), 286–298.
  • Besl, P. J. & McKay, N. D. (1992). “A Method for Registration of 3-D Shapes.” IEEE TPAMI 14(2), 239–256 (ICP).
  • Chen, Y. & Medioni, G. (1992). “Object modelling by registration of multiple range images.” Image and Vision Computing 10(3), 145–155 (point-to-plane ICP).
  • Segal, A., Haehnel, D. & Thrun, S. (2009). “Generalized-ICP.” RSS (GICP).
  • Yang, H., Shi, J. & Carlone, L. (2020). “TEASER: Fast and Certifiable Point Cloud Registration.” IEEE T-RO 37(2), 314–333.
  • DeTone, D., Malisiewicz, T. & Rabinovich, A. (2018). “SuperPoint: Self-Supervised Interest Point Detection and Description.” CVPRW.
  • Sarlin, P.-E., DeTone, D., Malisiewicz, T. & Rabinovich, A. (2020). “SuperGlue: Learning Feature Matching with Graph Neural Networks.” CVPR.
  • Lindenberger, P., Sarlin, P.-E. & Pollefeys, M. (2023). “LightGlue: Local Feature Matching at Light Speed.” ICCV.
  • Sun, J. et al. (2021). “LoFTR: Detector-Free Local Feature Matching with Transformers.” CVPR.
  • Ronneberger, O., Fischer, P. & Brox, T. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation.” MICCAI.
  • He, K., Gkioxari, G., Dollár, P. & Girshick, R. (2017). “Mask R-CNN.” ICCV.
  • Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). “Focal Loss for Dense Object Detection.” ICCV (RetinaNet).
  • Carion, N. et al. (2020). “End-to-End Object Detection with Transformers.” ECCV (DETR).
  • Zhao, Y. et al. (2024). “DETRs Beat YOLOs on Real-time Object Detection.” CVPR (RT-DETR).
  • Cheng, B. et al. (2022). “Masked-attention Mask Transformer for Universal Image Segmentation.” CVPR (Mask2Former).
  • Kirillov, A. et al. (2023). “Segment Anything.” ICCV (SAM).
  • Ravi, N. et al. (2024). “SAM 2: Segment Anything in Images and Videos.” arXiv:2408.00714.
  • Radford, A. et al. (2021). “Learning Transferable Visual Models from Natural Language Supervision.” ICML (CLIP).
  • Oquab, M. et al. (2023). “DINOv2: Learning Robust Visual Features without Supervision.” arXiv:2304.07193.
  • Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. & Koltun, V. (2020). “Towards Robust Monocular Depth Estimation.” IEEE TPAMI (MiDaS).
  • Yang, L., Kang, B., Huang, Z., Zhao, Z. & Feng, J. (2024). “Depth Anything V2.” NeurIPS.
  • Mildenhall, B. et al. (2020). “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” ECCV.
  • Kerbl, B., Kopanas, G., Leimkühler, T. & Drettakis, G. (2023). “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” SIGGRAPH.
  • Teed, Z. & Deng, J. (2020). “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow.” ECCV.
  • Lipson, L., Teed, Z. & Deng, J. (2021). “RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching.” 3DV.
  • Chang, J.-R. & Chen, Y.-S. (2018). “Pyramid Stereo Matching Network.” CVPR (PSMNet).
  • Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. (2018). “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes.” RSS.
  • Wang, C. et al. (2019). “DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion.” CVPR.
  • Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M. & Fox, D. (2022). “Megapose: 6D Pose Estimation of Novel Objects via Render & Compare.” CoRL.
  • Wen, B., Yang, W., Kautz, J. & Birchfield, S. (2024). “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects.” CVPR.
  • Fang, H.-S., Wang, C., Gou, M. & Lu, C. (2020). “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping.” CVPR.
  • Jiang, Z., Zhu, Y., Svetlik, M., Fang, K. & Zhu, Y. (2021). “Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations.” RSS (GIGA).
  • Sundermeyer, M., Mousavian, A., Triebel, R. & Fox, D. (2021). “Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes.” ICRA.
  • Tobin, J. et al. (2017). “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.” IROS.
  • Denninger, M. et al. (2019). “BlenderProc.” arXiv:1911.01911.
  • Hinton, G., Vinyals, O. & Dean, J. (2015). “Distilling the Knowledge in a Neural Network.” arXiv:1503.02531.
  • Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). “On Calibration of Modern Neural Networks.” ICML.
  • Northcutt, C., Jiang, L. & Chuang, I. (2021). “Confident Learning: Estimating Uncertainty in Dataset Labels.” JAIR.
  • Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). “Visual Instruction Tuning.” NeurIPS (LLaVA).
  • Li, J., Li, D., Savarese, S. & Hoi, S. (2023). “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs.” ICML.
  • Lin, T.-Y. et al. (2014). “Microsoft COCO: Common Objects in Context.” ECCV.
  • Geiger, A., Lenz, P. & Urtasun, R. (2012). “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite.” CVPR.
  • Caesar, H. et al. (2020). “nuScenes: A Multimodal Dataset for Autonomous Driving.” CVPR.
  • Sun, P. et al. (2020). “Scalability in Perception for Autonomous Driving: Waymo Open Dataset.” CVPR.
  • Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. (2017). “YCB-Video Dataset.” (used in PoseCNN paper).
  • Hodaň, T. et al. (2017). “T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects.” WACV.
  • Hinterstoisser, S. et al. (2012). “Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes.” ACCV (LineMod).
  • Hodaň, T. et al. (2024). “BOP Challenge 2023 / 2024 Reports.” (annual).
  • OpenCV documentation, opencv.org (4.10+).
  • PyTorch / TorchVision documentation, pytorch.org.
  • Ultralytics YOLOv8 / v9 / v10 / v11 documentation, docs.ultralytics.com.
  • Detectron2 documentation, github.com/facebookresearch/detectron2.
  • NVIDIA Isaac Sim / Replicator documentation (2024).
  • NVIDIA TensorRT 10.x Developer Guide.
  • Open3D documentation, www.open3d.org.
  • Boston Dynamics (2024). Stretch Technical Brief.
  • Skydio (2023). Skydio X10 Product Brief.
  • Covariant (2022). “RFM-1: A Foundation Model for Robotics.” Robotics Science and Systems Workshop on Robot Learning.