LiDAR, Depth Cameras & RGB — Robotics Exteroception
See also (Tier 3 family index): Perception Sensors
1. At a glance
Where [[Robotics/sensors-pose-motion]] answers “where am I?”, this note answers “what’s outside me?“. Exteroception — the sensing layer that maps the environment — is what separates a controllable machine from an autonomous one. Without it, a robot can servo joints to commanded angles but cannot avoid a wall, pick a part off a moving conveyor, or stop for a pedestrian.
Four sensor families dominate exteroceptive perception in 2026, plus three emerging:
- RGB cameras — pinhole-projection imagers, 1–60+ Mpx, 30–240 fps. Passive (need ambient light), texture-rich, no native depth. Bayer-filter CMOS sensors are the universal substrate (Sony IMX series, ON Semi AR series, OmniVision OV series).
- Depth cameras — RGB-D devices that emit their own illumination and recover Z per pixel:
- Stereo (passive triangulation between two RGB cameras) — ZED, RealSense D435/D455.
- Structured light (project a known IR pattern, observe deformation) — Kinect v1, RealSense D435 (assisted stereo).
- Time-of-Flight (ToF) — measure photon round-trip time directly (dToF: Kinect Azure, iPhone Lidar, Lucid Helios) or AMCW phase shift (iToF: Intel L515 (EoL), Microsoft AzureKinect-class).
- LiDAR — scanning or flash laser ranging at 200 m+:
- Mechanical spinning — 360° horizontal, fixed vertical lines (Velodyne HDL/VLP, Ouster OS series, Hesai Pandar). 5–25 Hz frame rate; 16 to 128+ channels.
- Solid-state MEMS or OPA — fixed FoV, microelectromechanical or optical-phased-array beam steering (Innoviz, Luminar Iris+, Hesai AT128). No moving parts at the macro scale; automotive-qualified.
- FMCW LiDAR — coherent detection, simultaneous range and radial velocity (Aeva Aeries II, Mobileye Lumotive). Immune to interference from other LiDARs.
- Flash LiDAR — single-shot wide-FoV illumination, no scanning; range-limited.
- mm-wave radar — 24, 60, 77, 79 GHz FMCW radar producing range–Doppler–azimuth tensors. Robust through fog, rain, dust, and snow. Continental ARS548, Bosch LRR, TI AWR2944. Lower angular resolution than camera or LiDAR; resolves radial velocity directly.
Emerging:
- Event cameras — per-pixel temporal-contrast (DVS / DAVIS / Metavision) with µs-class latency and HDR > 120 dB; Prophesee, iniVation. Sparse async output.
- Polarization cameras — Sony IMX264MZR / IMX250MZR per-pixel micropolarizer; specular vs diffuse separation, glare suppression, transparent-object handling.
- Hyperspectral / NIR — line-scan or snapshot tens-to-hundreds of bands; Headwall, Specim, Cubert. Material classification in agriculture, recycling, food sorting.
First ask before selecting:
- Range — 0.1 m (cobot fingertip ToF), 5 m (indoor mobile robot), 100 m (campus AGV), 250 m (highway autonomy)?
- Refresh rate — 10 Hz mapping, 30 Hz tracking, 100 Hz reactive avoidance, 1 kHz VR?
- Illumination & weather — controlled indoors, daylight outdoors, fog/rain/snow, night?
- Reflectance — black tyre vs white wall vs glass vs sky?
- Bandwidth & compute — what does the downstream perception stack ingest (and where does it run)?
- Cost, SWaP, certification — 80 k automotive LiDAR; cobot-class IP67 to military MIL-STD-810.
The dominant trend through 2026 is sensor fusion as the default architecture: there is no single sensor that wins on every axis, so production autonomy stacks combine cameras (semantics, density), LiDAR (geometric truth, range), radar (velocity, weather), and IMU (ego-motion, see [[Robotics/sensors-pose-motion]]).
2. First principles
RGB camera — projection model
A pinhole camera maps a 3D point P = (X, Y, Z) in the camera frame to a 2D pixel p = (u, v) via the intrinsic matrix K:
[u] [fx 0 cx] [X/Z]
[v] = [ 0 fy cy] [Y/Z]
[1] [ 0 0 1] [ 1 ]
with focal lengths f_x, f_y in pixels (= focal-length-mm × pixels-per-mm), principal point (c_x, c_y) at the optical axis, and the standard OpenCV convention +Z forward, +X right, +Y down. Real lenses add distortion — the radial-tangential Brown-Conrady model is the default in OpenCV calibration:
x' = x·(1 + k1·r² + k2·r⁴ + k3·r⁶) + 2·p1·x·y + p2·(r² + 2·x²)
y' = y·(1 + k1·r² + k2·r⁴ + k3·r⁶) + p1·(r² + 2·y²) + 2·p2·x·y
with r² = x² + y² in normalised coordinates. Wide-angle lenses (FoV > 100°) need a higher-order model — Kannala-Brandt fisheye (4 parameters), MEI omnidirectional, or division model.
The image sensor itself is a Bayer-pattern CMOS array: 2×2 RGGB mosaic over photodiodes; the missing channels at each pixel are interpolated (“demosaiced”, typically AHD or Malvar-He-Cutler). Image-signal processor (ISP) on the sensor or host applies black-level correction, lens shading correction, white balance, demosaic, denoise, sharpen, gamma, tone mapping. For computer vision, capture RAW Bayer or 12–16 bit linear before the ISP destroys photometric linearity.
Stereo depth — triangulation
Two rectified cameras with parallel optical axes, baseline B, focal length f (pixels):
Z = f · B / d
with disparity d = u_L − u_R the horizontal pixel offset of corresponding features. Disparity is computed by stereo matching: block-match (BM), semi-global match (SGBM, Hirschmüller 2008), or learned (PSMNet, RAFT-Stereo, FoundationStereo).
Differentiating gives the depth uncertainty:
σ_Z = (Z² / (f · B)) · σ_d
Depth accuracy degrades quadratically with range. Doubling baseline halves the σ_Z at any given range, but increases the minimum range (sub-baseline parallax becomes a vanishing disparity) and shrinks the stereo-overlap FoV.
Structured light
Project a known pseudo-random IR pattern (Kinect v1 used a fixed-position diffractive optical element); a single IR camera observes the pattern’s deformation in 3D and triangulates each pattern element against its design position. Effective baseline ≈ projector-to-sensor distance. Works well indoors at 0.5–5 m with controlled IR exposure; saturates under direct sunlight because solar IR overwhelms the projected pattern. Intel RealSense D-series uses active stereo (two IR cameras + IR pattern projector) — the pattern adds texture to featureless surfaces but disparity is still computed by stereo matching; the projector is effectively a textured “fill light”.
Time-of-flight
- Direct ToF (dToF) — emit a short laser pulse (ns) and time the photon arrival at a single-photon avalanche diode (SPAD) array. Range = c · Δt / 2. Sony IMX516, ams VL53L8, Apple LiDAR (Sony custom). Excellent accuracy (mm-class), wide dynamic range, sun-tolerant.
- Indirect ToF (iToF) — amplitude-modulate the illumination at frequency f_m (10–100 MHz), measure phase shift φ between emitted and reflected; Z = (c · φ) / (4π · f_m). Cheaper sensor (CMOS PMD pixel) but suffers range aliasing at d_max = c / (2 · f_m), multi-path interference, and sun saturation.
The fundamental ToF precision floor is photon-shot-noise limited:
σ_Z ≈ (c / (4π · f_m · SNR)) · √(BW) [for iToF]
σ_Z ≈ (c · σ_t / 2) [for dToF, σ_t = SPAD jitter]
A typical iToF at 80 MHz modulation with SNR = 100 (well-lit scene) achieves σ_Z ≈ 3 mm at 1 m. dToF with 200 ps SPAD jitter achieves σ_Z ≈ 30 mm independent of range (until photon-count drops).
LiDAR — scanning principles
Each laser pulse measures one range sample. Frame rate × points-per-frame = point rate (~10⁵ to 10⁷ points/s). Three optical-architecture families:
- Mechanical spinning — a stack of N laser-detector pairs rotates 360°. Vertical resolution = N (fixed, set by laser stack). Horizontal resolution = pulses-per-rotation. Velodyne VLP-16/32, Ouster OS0/1/2 (64–128 channels using VCSEL arrays + SPAD focal-plane), Hesai Pandar.
- Solid-state MEMS — a single (or few) laser-detector pair sweeps across a fixed FoV via a vibrating mirror (~1 mm Si MEMS). No 360° coverage but no macro moving parts — automotive-grade reliability. Innoviz Two, Luminar Iris+, Hesai AT128.
- Optical phased array (OPA) — array of phase-controlled emitters steers a beam electronically with zero moving parts. Mobileye/Lumotive in 2024–2026 ramp. Promising but immature compared to MEMS.
FMCW is orthogonal to scanning architecture: instead of a pulse, emit a frequency-chirped continuous wave; the beat between the transmitted and returned signal gives range and radial velocity simultaneously. Coherent detection at the photodiode rejects all incoherent light — sun-insensitive and immune to cross-talk from other LiDARs at the same wavelength. Aeva Aeries II, Mobileye, Scantinel. Cost has historically been the barrier.
mm-wave radar
77 GHz automotive radars are FMCW: emit a linear frequency ramp (“chirp”) of duration T_c and bandwidth B_w. The beat frequency f_b = (2 · R · B_w) / (c · T_c) gives range R; Doppler shift across multiple chirps gives radial velocity; an antenna array (MIMO virtual array) gives azimuth and elevation by digital beamforming.
range resolution ΔR = c / (2 · B_w)
velocity resolution Δv = λ / (2 · T_frame)
angular resolution Δθ ≈ λ / (N · d_ant)
A TI AWR2944 with B_w = 4 GHz achieves ΔR = 3.75 cm at 250 m max range; a 4 Tx × 4 Rx MIMO array gives 16 virtual antennas → ~3.5° azimuth resolution. Weather robustness comes from the cm-class wavelength (λ = 3.9 mm at 77 GHz) being much larger than typical fog droplets and raindrops; attenuation is ~0.1 dB/km in heavy rain vs ~30 dB/km for 905 nm LiDAR.
Event cameras
A DVS / DAVIS pixel asynchronously emits a “+1” or “−1” spike when its log-intensity changes by ≥ a contrast threshold C (typically 10–25 %). No frame rate; output is a stream of (x, y, t, polarity) events at µs timestamps. Dynamic range > 120 dB (vs ~70 dB for a standard CMOS); native temporal resolution 1 µs. Prophesee Genx320 / EVK4, iniVation DVXplorer. The event stream demands a different algorithmic stack (asynchronous CNNs, spiking networks, event-by-event SLAM) — see [[Robotics/computer-vision-robotics]].
3. Practical math / worked examples
Worked example A — Stereo depth accuracy at range
A Stereolabs ZED 2i has baseline B = 12 cm and rectified focal length f = 1400 px (at 1080p / HD mode, 110° HFoV lens). Block-matching disparity precision on a well-textured surface is σ_d ≈ 0.5 px (good lighting, sub-pixel refinement enabled).
At Z = 1 m : σ_Z = (1² · 0.5) / (1400 · 0.12) = 0.003 m = 3 mm
At Z = 5 m : σ_Z = (5² · 0.5) / (1400 · 0.12) = 0.074 m = 7.4 cm
At Z = 10 m: σ_Z = (10² · 0.5) / (1400 · 0.12) = 0.30 m = 30 cm
At Z = 20 m: σ_Z = (20² · 0.5) / (1400 · 0.12) = 1.19 m
Implication. Indoor mobile robot at 5 m: stereo is fine. Curbside autonomous vehicle at 50 m: stereo is useless — for the ZED 2i, σ_Z at 50 m is 7.4 m. Need LiDAR, or a stereo rig with B = 1 m (e.g. truck-roof-mounted), or fuse with monocular metric-depth networks (DepthAnything-V2, Metric3D, ZoeDepth) that exploit perspective cues to constrain absolute scale.
Baseline-tuning rule of thumb: pick B ≈ Z_max / 100 for ~1 % depth precision at maximum range. For Z_max = 20 m, B ≈ 20 cm; for Z_max = 100 m, B ≈ 1 m.
Worked example B — LiDAR point density on a wall
Velodyne VLP-16 specs: 16 lasers across 30° vertical FoV (so 2.0° vertical spacing), 0.1–0.4° horizontal angular resolution (configurable), 10 Hz spin rate, 600 000 points/s total firing rate (single-return mode).
At 30 m range, the lateral spacing of returns on a wall normal to the LiDAR is:
horizontal: 30 m · tan(0.2°) = 0.105 m ≈ 10.5 cm
vertical: 30 m · tan(2.0°) = 1.05 m
So one full sweep at 30 m gives a roughly 10 cm × 105 cm grid — very anisotropic, “rakes” with big vertical gaps. A pedestrian at 30 m (1.7 m tall, 0.4 m wide) gets hit by ~2 vertical lines × 4 horizontal columns = ~8 returns total. Detection works but classification is hard.
Compare Ouster OS1-128 at 30 m:
horizontal: 30 m · tan(0.18°) = 0.094 m
vertical: 30 m · tan(45°/128) = 0.184 m
Same pedestrian: ~9 vertical × 4 horizontal = ~36 returns, sufficient for class label from PointPillars / CenterPoint. Modern 128-channel sensors are 5–10× denser at any given range, which is why automotive LiDAR moved decisively from 16/32 channels in 2018 to 128–300+ channels in 2024.
Worked example C — Camera intrinsics calibration (OpenCV)
Standard pinhole calibration procedure with cv2.calibrateCamera:
- Print a 9×6 checkerboard with 25 mm square side. Mount flat on a rigid backing.
- Capture 20–30 views at varied angles and translations across the FoV. Ensure the board covers all four image quadrants in at least 5 views.
- Detect inner corners with
cv2.findChessboardCornersSB(better sub-pixel than the legacy version). - Call
cv2.calibrateCamera(objectPoints, imagePoints, imageSize, None, None)→ returns K (intrinsics 3×3), dist (5 distortion coefficients: k1, k2, p1, p2, k3), per-view rotation/translation, and per-view reprojection error.
Expected reprojection error:
| RMS reproj error | Diagnosis |
|---|---|
| < 0.2 px | Excellent — research-grade |
| 0.2–0.5 px | Good — production-fine |
| 0.5–1.0 px | Marginal — recapture with more views |
| > 1.0 px | Bad — likely motion blur, bad pattern detection, or wrong distortion model |
For wide-FoV lenses (> 100°), switch to cv2.fisheye.calibrate (Kannala-Brandt, 4 coeffs). For multi-camera rigs (stereo, multi-cam), calibrate intrinsics first, then extrinsics with cv2.stereoCalibrate — fix the intrinsics by passing flags=CALIB_FIX_INTRINSIC. Re-calibrate annually or after any mechanical shock; thermal drift of plastic lens housings can shift principal point by 1–3 px over 50 °C, enough to matter for SLAM loop closure.
Worked example D — Camera bandwidth at 4K @ 60 fps
A 4K UHD (3840×2160) sensor at 60 fps, 12-bit RAW:
Pixels/frame = 3840 · 2160 = 8.29 Mpx
Bits/frame = 8.29 Mpx · 12 = 99.5 Mb
Bytes/frame = 99.5 / 8 = 12.4 MB
Bytes/s (raw) = 12.4 MB · 60 fps = 746 MB/s
Transport options:
| Interface | Bandwidth | Notes |
|---|---|---|
| USB 3.2 Gen 2 | 1.25 GB/s | OK; uncompressed possible at 4K30 only without ISP compression |
| USB 3.2 Gen 2×2 | 2.5 GB/s | Rare on host side |
| 10 GbE | 1.0 GB/s usable | Industrial standard; GigE Vision protocol |
| MIPI CSI-2 D-PHY 4-lane | 9 Gbps (4×2.5 Gbps) | Direct sensor → SoC on embedded boards |
| MIPI CSI-2 C-PHY 3-trio | up to 22 Gbps | Newer Jetson Orin, Apple |
| GMSL2 (Maxim) | 6 Gbps | Automotive, single coax up to 15 m |
| FPD-Link III (TI) | 4.16 Gbps | Automotive, same role as GMSL |
| CoaXPress 12 | 12.5 Gbps | Industrial; one coax per lane |
For 8 cameras at 4K60 (a typical autonomous-vehicle perimeter), aggregate raw bandwidth is ~6 GB/s — necessitating either on-sensor JPEG/H.265 compression, ISP-side downsampling, or sustained Jetson AGX Orin / TI TDA4VM-class compute. Lossy compression destroys photometric linearity; ML inference is robust to this, but stereo matching and photometric SLAM (DSO, LSD-SLAM) suffer.
Worked example E — Radar range, velocity, angular resolution
TI AWR2944 cascaded MIMO radar at 77 GHz: B_w = 4 GHz, T_c = 25 µs chirp, T_frame = 50 ms, 4 Tx × 4 Rx → 16 virtual antennas with d_ant = λ/2.
ΔR = c / (2 · B_w) = 3·10⁸ / (2 · 4·10⁹) = 0.0375 m = 3.75 cm
Δv = λ / (2 · T_frame) = 0.0039 / (2 · 0.05) = 0.039 m/s ≈ 0.14 km/h
Δθ ≈ λ / (N · d_ant) = 0.0039 / (16 · 0.00195) = 0.125 rad ≈ 7.2° (uniform)
With non-uniform MIMO and super-resolution (MUSIC, ESPRIT), angular resolution drops to ~1–2° — competitive with low-end LiDAR. Range × resolution trade-off is fundamental: doubling chirp duration halves range resolution but doubles unambiguous velocity range.
4. Design heuristics
Range vs sensor-family mapping
| Range | Recommended primary | Backup |
|---|---|---|
| 0–0.5 m | ToF (VL53L8) or capacitive | Vision proximity |
| 0.5–2 m | Structured light (RealSense D435), dToF (Lucid Helios) | RGB monocular |
| 2–10 m | Active stereo (RealSense D455), passive stereo (ZED 2i) | Indoor LiDAR (Livox Mid-360) |
| 10–100 m | Spinning LiDAR (Ouster OS1, Hesai QT128) | Stereo with B = 0.5 m |
| 50–300 m | Solid-state LiDAR (Luminar Iris+, Innoviz Two) + mm-wave radar | Long-FL telephoto camera |
| > 300 m | mm-wave radar (Continental ARS548 LRR) | Adaptive long-range LiDAR |
Indoor vs outdoor
| Environment | Works | Marginal | Fails |
|---|---|---|---|
| Office, lab, warehouse | RGB, stereo, ToF, structured light, LiDAR | Magnetometer | — |
| Outdoor daylight | RGB, stereo, LiDAR, radar | iToF (sun saturation) | Structured light (Kinect pattern washes out) |
| Outdoor night | LiDAR, radar, NIR-illuminated camera | RGB without illumination | Passive stereo |
| Direct sunlight on lens | LiDAR (905 nm narrow-band filter), radar | RGB (lens flare) | iToF, sunlit IR ToF |
| Fog / mist | Radar, FMCW LiDAR (partial) | Direct-detection LiDAR | RGB, stereo |
| Heavy rain | Radar | LiDAR (degraded ~30–50 %) | RGB (water on lens) |
| Snow / falling snow | Radar | LiDAR (snowflakes = false returns) | RGB |
| Dust storm | Radar | LiDAR (saturated) | RGB |
Update rate vs application
| Application | Required rate |
|---|---|
| Static map building | 5–10 Hz |
| Mobile robot navigation (≤ 2 m/s) | 10–20 Hz |
| Autonomous vehicle (highway) | 10–25 Hz LiDAR, 20–30 Hz radar, 30–60 Hz camera |
| Cobot collision avoidance | 30–60 Hz |
| Drone obstacle avoidance (10 m/s) | 60–120 Hz |
| VR / AR head-tracking | 90+ Hz camera + 1 kHz IMU |
| High-speed picking | 240+ Hz event camera or strobed RGB |
| Autonomous racing (Indy AC) | 100+ Hz LiDAR, 200+ Hz camera |
Sensor synchronisation
Software timestamps over Ethernet/USB drift by 1–50 ms with non-deterministic latency. For any multi-sensor stack:
- Hardware trigger all sensors that support it from a common pulse — typically a 1 PPS signal from a GNSS receiver or an FPGA-generated sync line.
- PTP / IEEE 1588 (
[[Languages/Tier3/ros2-robotics-config]]) on every sensor’s network interface. Modern industrial cameras (Basler ace2, FLIR Blackfly S Mono GigE) support PTP timestamps directly. - Hardware timestamping at the capture buffer: e.g. NVIDIA Jetson MIPI CSI-2 has a per-frame hardware timestamp captured at start-of-frame, exposed through Argus/V4L2.
- For LiDAR, treat each point (not each frame) as having its own timestamp — points within a single 100 ms scan span 100 ms of ego-motion, and motion-deskewing with IMU integration is essential at any vehicle speed.
Camera selection ladder
| Requirement | Pick |
|---|---|
| Hobby / prototyping | Raspberry Pi Camera v3 (Sony IMX708), Arducam |
| Industrial machine vision | Basler ace2, FLIR Blackfly S, Allied Vision Alvium |
| Outdoor automotive | Sony IMX490, Sony IMX728, ON Semi AR0820AT (LFM, HDR, AEC-Q100) |
| Drone obstacle avoid | Global-shutter rolling-replacement (Sony IMX296, OmniVision OV9281) |
| HDR scene (tunnel exits, welding) | LFM-equipped sensors (Sony IMX390, IMX490) |
| High-speed | Sony IMX250 (5 Mpx, 163 fps), Vision Datum LU series |
| Low-light / NIR | Sony Starvis IMX462, IMX485; intensified Onsemi KAE |
| Polarization | Sony IMX250MZR / IMX264MZR (mono polar) |
| Event | Prophesee Genx320 (320×320, automotive), IMX636 (1280×720) |
Global vs rolling shutter. Rolling-shutter sensors expose row-by-row over typically 5–30 ms. On any moving robot or any sensor moving through space, this introduces rolling-shutter skew — straight lines bow, fast objects shear. SLAM and VIO algorithms break down or require explicit rolling-shutter compensation (Hedborg 2012). For drones, AGVs, automotive: pay the BoM premium for global shutter (~2× cost).
Stereo geometry tuning
- Baseline rule: B ≈ Z_max / 100 for ~1 % depth precision at range.
- Focal length trades FoV for angular pixel resolution; a 50° HFoV at 1080p ≈ 2000 px/rad ≈ 0.029°/pixel.
- Disparity search range d_max = f · B / Z_min sets compute cost; SGBM/RAFT cost scales linearly with disparity range. Limit by setting a sane minimum range (e.g. Z_min = 0.5 m for indoor robot, 5 m for highway).
- Texture is everything. Stereo matching collapses on blank walls, glass, ceilings. Hence “active stereo” with IR pattern projector (RealSense D435), or accept holes and fuse with other modalities.
GPU vs FPGA vs DSP
| Pipeline | Best fit | Why |
|---|---|---|
| Prototyping CNN inference | GPU (Jetson Orin, RTX) | Toolchain, batch size, flexibility |
| Production CNN inference on cost-sensitive embedded | NPU / accelerator (Hailo-8, Coral Edge TPU, Qualcomm AI engine) | TOPS/W |
| Safety-rated fixed-latency vision | FPGA (Xilinx Kria K26, Lattice CrossLink) | Deterministic latency, ISO 26262 |
| Stereo SGM at 4K60 | FPGA or fixed-function ISP block | Memory bandwidth, parallel disparity sweep |
| Event-camera processing | FPGA / neuromorphic (Loihi 2, Akida) | Asynchronous spike processing |
| Low-power VIO on drone | DSP + tightly-coupled IMU (Qualcomm Hexagon, Snapdragon Flight) | mW-scale power |
5. Components & sourcing — real parts
RGB cameras (industrial / robotics)
| Vendor | Family | Sensor format | Notes |
|---|---|---|---|
| Basler | ace 2, boost, dart | up to 24 Mpx | GigE Vision, USB3 Vision, MIPI-CSI variants |
| FLIR / Teledyne | Blackfly S, Oryx, Forge | up to 65 Mpx | High SNR; industrial standard |
| Allied Vision | Alvium 1500/1800, Mako, Manta | up to 31 Mpx | MIPI-CSI for embedded, GigE for stationary |
| Lucid Vision | Triton, Atlas | up to 65 Mpx | IP67-rated outdoor |
| IDS | uEye+, Ensenso | broad | German machine vision incumbent |
| Vieworks | VC, VP, VT | up to 250 Mpx | Inspection, high-resolution |
| The Imaging Source | DFK / DMK | 5–20 Mpx | Lower cost, USB3 |
| Arducam | Pi HQ, Orin module | embedded | MIPI-CSI for Jetson, Pi |
| Leopard Imaging | LI-IMX477, GMSL2 modules | varied | Automotive over GMSL |
| e-con Systems | See3CAM, NileCAM | varied | MIPI for Jetson / RB5 |
Underlying CMOS sensor families (sourced separately on production boards): Sony IMX477 (12 Mpx, 60 fps, used in Pi HQ Camera), IMX585 (8 Mpx Starvis-2 low light), IMX490 (5.4 Mpx 120 dB LED-flicker-mitigated automotive), IMX661 (127 Mpx APS-H, machine vision); ON Semi AR0820AT (8 Mpx HDR automotive); OmniVision OV9281 (1 Mpx mono global shutter, drone VIO).
Stereo and RGB-D cameras
| Vendor | Model | Tech | Range | Notes |
|---|---|---|---|---|
| Stereolabs | ZED 2i | Passive stereo, B = 12 cm | 0.3–20 m | 1080p120, IP66, integrated IMU |
| Stereolabs | ZED X / X Mini | Active stereo + GMSL2 | 0.2–20 m | Automotive/AGV; sun-tolerant |
| Intel RealSense | D435i / D435f | Active stereo + IR projector + IMU | 0.3–10 m | EoL announced; still in field |
| Intel RealSense | D455 / D456 | Active stereo, B = 9.5 cm | 0.6–20 m | D456 is IP65 / IP67 |
| Intel RealSense | D457 | GMSL2 industrial | 0.6–20 m | Designed for AGV / fixed installs |
| Orbbec | Femto Bolt, Femto Mega | dToF + RGB | 0.25–5.46 m | Azure Kinect successor |
| Orbbec | Gemini 2 / 335 | Structured light + RGB | 0.15–10 m | Indoor robotics |
| Lucid Vision | Helios 2 | dToF (Sony IMX556) | 0.3–8.3 m | IP67, GigE; sun-tolerant |
| Photoneo | PhoXi 3D Scanner, MotionCam 3D | Structured light | 0.5–6 m | High-precision bin picking |
| Zivid | Zivid 2, Zivid 2+ | Structured light | 0.3–3 m | Sub-mm bin picking |
| Mech-Mind | Mech-Eye Pro M / S | Structured light | 0.3–3 m | Chinese bin-picking standard |
| Pickit | Pickit M | Structured light | 0.4–2.5 m | Industrial bin pick |
| Microsoft | Azure Kinect (EoL 2023) | dToF + RGB + 7-mic | 0.5–5.5 m | Still in research use |
LiDAR — spinning
| Vendor | Model | Channels | Range | Notes |
|---|---|---|---|---|
| Velodyne (now Ouster) | VLP-16 (Puck) | 16 | 100 m | Legacy 2014 standard; cheap on eBay |
| Velodyne (now Ouster) | HDL-32E / HDL-64E | 32 / 64 | 100 / 120 m | DARPA-era; obsolete in new builds |
| Ouster | OS0-64 / OS0-128 | 64 / 128 | 35 m | 90° vertical FoV; near-field |
| Ouster | OS1-32 / OS1-128 | 32 / 128 | 200 m / 120 m | Mainstream AGV/AMR sensor |
| Ouster | OS2-64 / OS2-128 | 64 / 128 | 240 m | Long-range autonomy |
| Hesai | Pandar QT64 / QT128 | 64 / 128 | 30 / 50 m | Short-range automotive perception |
| Hesai | Pandar 40P / 64 / 128 | 40 / 64 / 128 | 200 m | Robotaxi standard in China |
| Hesai | Pandar XT32M2X | 32 | 300 m | Long-range single-return |
| RoboSense | RS-LiDAR-16 / 32 / 80 | 16 / 32 / 80 | 150–250 m | Lower-cost mechanical |
| Livox | Mid-360 / HAP / Avia | non-repetitive | 40–450 m | Petal-pattern, low-cost |
| Cepton | Vista-X90 / Nova | solid-state MMT | 150–200 m | Frictionless mirror tech |
LiDAR — solid-state and FMCW
| Vendor | Model | Architecture | Range | Notes |
|---|---|---|---|---|
| Innoviz | Innoviz One, Innoviz Two | MEMS solid-state | 250–300 m | Tier-1 automotive (BMW, VW) |
| Luminar | Iris, Iris+ | 1550 nm fiber laser + MEMS | 300–500 m | Volvo EX90, Mercedes |
| Hesai | AT128 / AT512 / ATX | MEMS-like 1D scanning | 200–300 m | Highest-volume L3 automotive |
| Aeva | Aeries II | FMCW MEMS | 500 m | Volkswagen, Daimler Truck |
| Mobileye | Lumotive OPA | OPA solid-state | varies | 2025–2026 ramp |
| Valeo | Scala 3 | mechanical scanning | 200 m | First-mass-production AD LiDAR (Audi A8 2018 onward) |
| Continental | HRL131 | flash + scanning hybrid | 150 m | Tier-1 automotive |
| Sense Photonics | Osprey / Sense One | flash dToF | 30–100 m | Industrial short-range |
| Cepton | Nova | MMT solid-state | 30 m | Compact, near-field |
mm-wave automotive radar
| Vendor | Model | Band | Notes |
|---|---|---|---|
| Continental | ARS548 LRR | 77 GHz | 4D imaging radar, 300 m |
| Continental | SRR520 SRR | 77 GHz | Short range, blind-spot |
| Bosch | LRR4 / LRR5 | 77 GHz | Mainstream ACC radar |
| Aptiv | RACam (radar + camera) | 77 GHz | Integrated front sensor |
| TI | AWR1843 / AWR2243 / AWR2944 | 77 GHz | Reference SoC / cascaded chips |
| NXP | TEF82xx / S32R45 | 77 GHz | RFCMOS + processor stack |
| Arbe | Phoenix | 79 GHz | 4D imaging, 2304 virtual channels |
| Uhnder | F1 (digital-code mod) | 77 GHz | First DCM (CDMA) automotive radar |
| Smartmicro | UMRR-96 | 77 GHz | AGV / industrial |
Event cameras and emerging
| Vendor | Model | Resolution | Notes |
|---|---|---|---|
| Prophesee | EVK4 (IMX636) | 1280×720 | Industrial dev kit |
| Prophesee | Genx320 | 320×320 | Automotive, ultra-low power |
| iniVation | DVXplorer / DAVIS346 | 640×480 / 346×260 | Research |
| Sony | IMX636 / IMX637 | up to HD | Direct Sony silicon (also in EVK4) |
Compute pairing
| Platform | TOPS / NPU | Camera lanes | Notes |
|---|---|---|---|
| NVIDIA Jetson Orin Nano (8 GB) | 40 TOPS | 4 × MIPI CSI-2 | Hobby + prosumer robotics |
| NVIDIA Jetson Orin NX 16 GB | 100 TOPS | 8 × MIPI CSI-2 | Drone / AMR |
| NVIDIA Jetson AGX Orin 64 GB | 275 TOPS | 16 × MIPI CSI-2 | Autonomous vehicle research |
| NVIDIA DRIVE Orin / Thor | 254 / 1000 TOPS | many | Production automotive |
| Qualcomm RB5 / RB6 | 15 / 30+ TOPS | 7 × MIPI CSI-2 | Snapdragon Flight drone |
| Intel NUC + Arc A380 | ~10 TFLOPS GPU | USB / PCIe | Indoor robotics |
| Hailo-8 / Hailo-10 | 26 / 40 TOPS | PCIe addon | Edge inference accelerator |
| AMD/Xilinx Kria K26 / KV260 | ~1.2 TOPS (Zynq DPU) | MIPI CSI-2, native | FPGA fixed-latency pipelines |
| Google Coral Edge TPU | 4 TOPS | USB/PCIe | Quantised int8 inference |
| TI TDA4VM / TDA4AEN | 8 TOPS + DSP | 4 × CSI-2 | Automotive ECU |
| Mobileye EyeQ6H | 34 TOPS | many | Robotaxi-grade |
| Apple M2 / M4 SoC | 16–38 TOPS NPU | shared LPDDR | ARKit, iPhone Lidar pipeline |
6. Reference data
Sensor-family comparison (quick reference)
| Family | Range | Accuracy | Refresh | Cost | Light required | Weather |
|---|---|---|---|---|---|---|
| RGB monocular | infinite (no depth) | n/a depth | 30–240 fps | 5k | yes | poor |
| Passive stereo | 0.3–30 m | 1 cm @ 3 m, 30 cm @ 10 m | 30–120 fps | 2k | yes | poor |
| Active stereo (IR) | 0.2–10 m | 1 cm @ 1 m, 10 cm @ 5 m | 30–90 fps | 2k | sun-degrades | indoor-fine |
| Structured light | 0.15–5 m | 1 mm–1 cm | 5–30 fps | $1–25k | sun-degrades | indoor-only |
| iToF | 0.3–8 m | 3 mm–3 cm | 15–60 fps | 3k | sun-saturates | indoor-fine |
| dToF | 0.1–10 m | 3 mm–3 cm | 15–60 fps | 5k | sun-tolerant | indoor + outdoor day |
| Spinning LiDAR | 0.5–300 m | 2–5 cm | 5–25 Hz | 80k | no | rain ~30 % loss |
| Solid-state LiDAR | 1–500 m | 3–10 cm | 10–25 Hz | 20k | no | rain ~30 % loss |
| FMCW LiDAR | 1–500 m | 3 cm + 0.1 m/s | 10–20 Hz | 30k | no | rain ~20 % loss |
| mm-wave radar | 0.2–300 m | 4 cm range, 0.1 m/s | 10–50 Hz | 500 | no | excellent |
| Event camera | n/a depth | 1 µs latency | async | 5k | no (HDR) | excellent (wide DR) |
LiDAR vendor / model table (2026)
| Vendor / Model | Channels | Max range (10 % refl) | FoV (H×V) | Points/s | Frame rate | Wavelength |
|---|---|---|---|---|---|---|
| Velodyne VLP-16 | 16 | 100 m | 360°×30° | 600 k | 5–20 Hz | 903 nm |
| Ouster OS1-128 | 128 | 200 m (at 10 %) | 360°×45° | 5.2 M | 10–20 Hz | 865 nm |
| Ouster OS2-128 | 128 | 240 m | 360°×22.5° | 5.2 M | 10 Hz | 865 nm |
| Hesai Pandar 128 | 128 | 200 m | 360°×40° | 3.5 M | 10–20 Hz | 905 nm |
| Hesai AT128 | 128 | 200 m | 120°×25.4° | 1.5 M | 10 Hz | 905 nm |
| Hesai QT128 | 128 | 50 m | 360°×104.2° | 3.5 M | 10 Hz | 905 nm |
| Livox Mid-360 | non-repeat | 40 m | 360°×59° | 200 k | 10 Hz | 905 nm |
| Innoviz Two | n/a (solid) | 300 m | 120°×40° | varies | 10–20 Hz | 905 nm |
| Luminar Iris+ | n/a | 500 m | 120°×26° | varies | 10–30 Hz | 1550 nm |
| Aeva Aeries II | FMCW | 500 m | 120°×30° | 1.6 M | 10–20 Hz | 1550 nm |
| RoboSense M1 | solid | 200 m | 120°×25° | 750 k | 10 Hz | 905 nm |
| Valeo Scala 3 | mech | 200 m | 133°×11° | varies | 25 Hz | 905 nm |
| Continental HRL131 | hybrid | 150 m | 120°×28° | varies | 25 Hz | 905 nm |
RGB-D camera comparison
| Model | Tech | Depth resolution | Frame rate | Min/Max range | IMU | Cost (USD) |
|---|---|---|---|---|---|---|
| RealSense D435i | Active stereo | 1280×720 | 90 fps | 0.3–3 m (typ) | yes | $300 |
| RealSense D455 | Active stereo | 1280×720 | 90 fps | 0.6–6 m | yes | $400 |
| RealSense D457 | Active stereo (GMSL) | 1280×720 | 90 fps | 0.6–20 m | yes | $700 |
| ZED 2i | Passive stereo | 2208×1242 | 100 fps | 0.3–20 m | yes | $500 |
| ZED X | Active+GMSL2 | 1920×1200 | 60 fps | 0.2–20 m | yes | $1500 |
| Orbbec Femto Bolt | dToF | 1024×1024 | 30 fps | 0.25–2.88 m | yes | $400 |
| Orbbec Femto Mega | dToF | 1024×1024 | 30 fps | 0.25–5.46 m | yes | $600 |
| Lucid Helios 2 | dToF | 640×480 | 30 fps | 0.3–8.3 m | no | $1700 |
| Photoneo PhoXi M | Structured light | 2064×1544 | 0.25 fps (full) | 0.5–2 m | no | $13k |
| Zivid 2+ M60 | Structured light | 1944×1200 | 5 fps | 0.6–2 m | no | $14k |
| Microsoft Azure Kinect (EoL) | dToF | 1024×1024 | 30 fps | 0.5–5.46 m | yes | $400 used |
Camera RAW bandwidth (uncompressed)
| Resolution | fps | Bit depth | Bandwidth |
|---|---|---|---|
| VGA 640×480 | 60 | 8 | 17.6 MB/s |
| 720p 1280×720 | 30 | 8 | 26.4 MB/s |
| 1080p 1920×1080 | 30 | 8 | 59.3 MB/s |
| 1080p 1920×1080 | 60 | 12 | 178 MB/s |
| 4K 3840×2160 | 30 | 12 | 373 MB/s |
| 4K 3840×2160 | 60 | 12 | 746 MB/s |
| 8K 7680×4320 | 30 | 12 | 1.49 GB/s |
| 12 Mpx | 60 | 12 | 1.08 GB/s |
| 24 Mpx | 30 | 12 | 1.08 GB/s |
| 50 Mpx | 30 | 12 | 2.25 GB/s |
GPU / accelerator inference rates (typical)
| Platform | YOLOv8-S @ 640 | PointPillars (LiDAR) | DepthAnything-V2 Small |
|---|---|---|---|
| Jetson Orin Nano (8 GB) | ~60 fps INT8 | ~10 Hz | ~15 fps |
| Jetson Orin NX 16 GB | ~140 fps INT8 | ~25 Hz | ~35 fps |
| Jetson AGX Orin 64 GB | ~330 fps INT8 | ~70 Hz | ~80 fps |
| Hailo-8 | ~150 fps INT8 | n/a (no point-net) | ~25 fps |
| Coral Edge TPU | ~30 fps INT8 (limited model size) | n/a | n/a |
| RTX 4090 | ~1500 fps FP16 | ~250 Hz | ~400 fps |
(Numbers vary substantially with batch size, INT8/FP16 quantisation, and TensorRT vs ONNX runtime. Treat as order-of-magnitude.)
Weather robustness matrix
| Sensor | Clear | Light rain | Heavy rain | Fog 50 m | Snow | Dust | Direct sun |
|---|---|---|---|---|---|---|---|
| RGB camera | ★★★★★ | ★★★ | ★ | ★ | ★ | ★ | ★★ (flare) |
| Stereo | ★★★★★ | ★★★ | ★ | ★ | ★ | ★ | ★★ |
| Structured light | ★★★★★ | n/a outdoor | n/a outdoor | n/a outdoor | n/a outdoor | n/a outdoor | ✗ |
| ToF (iToF) | ★★★★★ | ★★ | ★ | ★ | ★ | ★ | ✗ |
| ToF (dToF, narrowband) | ★★★★★ | ★★★ | ★★ | ★★ | ★★ | ★★ | ★★★ |
| LiDAR (905 nm) | ★★★★★ | ★★★★ | ★★★ | ★★ | ★★ | ★★ | ★★★★ |
| LiDAR (1550 nm) | ★★★★★ | ★★★★ | ★★★ | ★★★ | ★★★ | ★★★ | ★★★★ |
| FMCW LiDAR (1550 nm) | ★★★★★ | ★★★★ | ★★★ | ★★★ | ★★★ | ★★★ | ★★★★ |
| mm-wave radar | ★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
| Event camera | ★★★★ | ★★★ | ★★ | ★ | ★★ | ★★ | ★★★★★ (HDR) |
7. Failure modes & debugging
- Lens flare and ghosting — sun, headlights, or bright LEDs in or near frame produce internal reflections off lens elements. Causes phantom features and false detections downstream. Mitigation: lens hood / baffle, anti-reflection multi-coating on the lens, LFM-equipped sensors for periodic LEDs.
- Rolling-shutter skew — straight vertical edges bow into S-curves on moving platforms; fast objects shear. Detection: stationary check pattern with the robot moving — should be straight. Fix: switch to global-shutter sensor (Sony IMX296, IMX392, IMX422; ON Semi AR0234) for any platform faster than ~1 m/s.
- Auto-exposure hunting — sun briefly behind a building, then in frame, sends AE chasing back and forth on a 100 ms scale, momentarily over/under-exposing every frame. Disable AE in mission mode; fix exposure from a calibrated lookup keyed to ambient lux, or use HDR sensors with 120+ dB intra-frame dynamic range.
- Depth holes — low texture (stereo) — blank walls, ceilings, glass. Stereo matchers return no correspondences. Fix: assisted stereo (IR pattern projector), fuse with ToF/LiDAR.
- Depth holes — specular reflections (ToF) — mirrors, polished metal, water. Photons bounce away from the sensor. Fix: polarization camera variant; LiDAR; flag-and-discard region.
- Depth holes — transparent surfaces — glass walls and windows transmit IR. Fix: same as specular; or treat glass as “ground truth wall” via secondary cues (CAD model, frame edges).
- LiDAR motion distortion (skewing) — a spinning LiDAR rotating at 10 Hz spans 100 ms; the robot moves several cm during that interval. Returns within a single “frame” come from different ego-positions. Mitigation: per-point timestamp + IMU-integrated ego-pose deskewing (KISS-ICP, FAST-LIO2 do this internally).
- iToF multi-path — light bounces off two surfaces before returning, giving a phase reading that is the sum of two distances. Manifests as ghost-depth halos behind objects or curved-looking corners. Fix: use multiple modulation frequencies + AI denoising; or switch to dToF.
- iToF sun saturation — outdoor sunlight contains broad-spectrum IR that overwhelms the photodetector before the modulated signal can be demodulated. Fix: indoor only, or use narrow-band 1550 nm dToF/LiDAR.
- Rain droplets on LiDAR window — falling drops generate spurious near-field returns; standing drops scatter the outgoing beam. Mitigation: hydrophobic coating, heated window, water-shedding geometry, dual-return filtering (drop = near return, scene = far return).
- Cross-talk between LiDARs at the same wavelength — two LiDARs of the same brand within line-of-sight may interpret each other’s pulses as their own returns. Manifests as random “ghost” points in the air. Fix: per-unit pulse coding (Innoviz, modern Hesai support this), FMCW (intrinsically immune), or wavelength separation.
- Bandwidth saturation — USB 3.0 host controller can sustain ~3.2 Gb/s real-world; running two D455s + a ZED 2i on the same USB3 root hub WILL drop frames intermittently. Diagnose with
dmesg | grep usband per-frame timestamps. Fix: separate USB host controllers (PCIe-USB cards), GigE Vision over wired Ethernet, MIPI CSI for embedded. - Time-sync drift in software-timestamped multi-camera rigs — observed as growing residuals in stereo / VIO over minutes. Fix: hardware trigger from a common source; PTP across Ethernet links; use hardware capture timestamps (V4L2, Argus, GigE Vision PTP).
- Lens calibration drift — plastic-mount lenses can shift 1–3 px in principal point over a 50 °C swing or a single 5 g impact. Manifests as SLAM map drift and degraded reprojection error. Mitigation: metal mounts; periodic recalibration; online self-calibration in VIO (VINS-Fusion, ORB-SLAM3 do this).
- GPU memory leaks in long-running perception pipelines — typical for ad-hoc Python/PyTorch wrappers that allocate tensors per frame without reuse. Monitor with
nvidia-smi,nsys. Fix: pre-allocate buffers, usetorch.cuda.empty_cache()between scenes, prefer C++/TensorRT for production. - Frame stale in pipeline — perception consumes a 100 ms-old image because the queue backed up. Fix: drop-on-write queues with timestamp gating; reject any frame older than 2× expected period.
- Coordinate-frame mistakes — the OpenCV camera convention is +Z forward, +X right, +Y down; ROS / REP-103 uses +Z up, +X forward, +Y left; OpenGL / Blender uses +Z toward viewer, +Y up; Unreal is +Z up, +X forward. Mismatched conventions produce flipped axes that are subtle bugs — features look “almost right” until you add a second sensor. Fix: nail down the frame_id of every published topic; static_transform_publisher diagrams; rviz to inspect.
- Stereo disparity bias on tilted floors — block-matchers assume horizontal epipolar lines; a misrectified rig produces a depth slope across the image. Fix: re-verify rectification after any shock; calibrate periodically.
- Radar ghost targets — multi-bounce returns (radar → guardrail → vehicle → radar) appear as targets that don’t exist. Fix: micro-Doppler signature filtering; map-based suppression of static reflectors.
- Radar angular ambiguity — sparse MIMO arrays alias; one true target can appear at multiple azimuths. Fix: super-resolution algorithms (MUSIC, IAA, compressed sensing); larger virtual array.
- Photometric vs geometric calibration drift — VIO drifts because feature descriptors degrade (sun angle, exposure changes), not because IMU is bad. Fix: photometric VO (DSO) is fragile here; ORB / SuperPoint features are more robust.
- Sensor bracket vibration resonance — a 6 mm 3D-printed bracket has its first mode at 30–80 Hz; a motor harmonic at 50 Hz excites it, smearing the camera image. Fix: stiffen the mount (FEM), add silicone damping, or move the sensor to a stiffer location.
8. Case studies
Tesla Autopilot HW3 / HW4 — vision-only autonomy
Tesla’s production fleet from 2018 onward runs an 8-camera vision-only perception stack (radar deprecated mid-2021 on most models, ultrasonics phased out in 2022). The eight cameras are three forward (narrow, main, wide), two side-pillar, two repeater (B-pillar), one rear, each based on a Sony IMX490-class 1.2 Mpx HDR automotive CMOS sensor with 120 dB dynamic range, LED flicker mitigation, and AEC-Q100 qualification. Capture is 36 fps per camera.
The HW3 compute board uses Tesla’s custom FSD Chip (two redundant NPUs, ~72 TOPS each, GF 14 nm) running a shared “Hydranet” multi-head CNN that consumes all 8 video streams in a unified bird’s-eye-view space (the occupancy network architecture from AI Day 2022). HW4 (HW3 successor, launched late 2023) roughly triples the compute and ups camera resolution to 5 Mpx. Tesla’s design thesis — controversial but consistent — is that LiDAR is unnecessary if camera-only neural networks can match the geometric inference; the company has staked its full-autonomy roadmap on this. The trade-off: vision-only stacks struggle in heavy rain, dense fog, and at night without lit infrastructure, where Waymo / Cruise / Zoox LiDAR stacks maintain performance.
Waymo Driver Gen-5 — multi-modal autonomy
The 5th-generation Waymo Driver (deployed in I-Pace and Jaguar test fleets, 2020 onward) carries one of the most sensor-rich production autonomy stacks of any commercial vehicle:
- 4 LiDARs — one long-range “Honeycomb” top-mounted spinning unit with 360° / 200+ m / ~1 cm range precision; one rooftop “Laser Bear” perimeter unit; two short-range corner LiDARs for the near-field around the vehicle.
- 29 cameras spanning long-range telephotos (for traffic-light reading at 250 m), perimeter cameras, and a high-resolution “peripheral vision” mast for cross-traffic.
- 6 radars (mostly 77 GHz FMCW) for adverse-weather redundancy and long-range Doppler.
- A custom Waymo compute platform fusing the lot.
The Honeycomb LiDAR achieves ~5 cm range accuracy at 100 m; classification accuracy on pedestrians at 100 m sits above 99 % in clear daylight, dropping to ~90 % in heavy rain. The Waymo Open Dataset (released 2019, expanded 2023–2024) is the publicly-traceable view into this stack’s performance.
Skydio X10 — autonomous drone obstacle avoidance
Skydio’s flagship enterprise drone (2023 release, succeeding X2D) is built around six Sony IMX-class CMOS sensors arranged as a true 360° hex-camera array — three on the top, three on the bottom — providing full hemispherical depth coverage with no blind spot. Each pair has a wide-baseline overlap (~10 cm), and the on-device visual SLAM stack runs on an NVIDIA Jetson Orin NX with a custom Skydio runtime, fusing the six-camera optical flow + a Bosch BMI088-class IMU + GNSS for sub-second-latency obstacle avoidance at 10 m/s flight speeds.
No LiDAR, no radar, no ultrasonic — pure vision. The architectural bet (similar to Tesla but for very different reasons) is that flight envelopes are bounded enough that vision-only depth at 5–25 m suffices, and the SWaP cost of even a small Livox Mid-360 LiDAR (265 g, 6 W) is unjustifiable on a 1.4 kg drone. The result is reliable autonomous flight through forests, bridges, and complex industrial sites — used by US DOT, fire departments, and military inspection units. Performance degrades at night (visible-light cameras with no on-board illumination); in 2026 Skydio introduced a NightSense IR mode using LWIR + active illumination.
Boston Dynamics Spot — robust outdoor perception
Each Spot platform carries 5 stereo camera assemblies (front, sides, rear) using PMD ToF + RGB sensors plus integrated IMU — Boston Dynamics’ “Stereo Camera” modules each combine a structured-light projector, two grayscale stereo cameras, and a color RGB camera. The combined 360° depth perception runs at 15 Hz and feeds the proprioceptive whole-body controller. For longer-range mapping in Spot Enterprise (mining, oil & gas), Boston Dynamics offers an optional Velodyne VLP-16 or Ouster OS0 spinning LiDAR via the EAP (Enhanced Autonomy Package) payload bay — this enables persistent map building over multi-kilometre patrol routes and integrates with the company’s Autowalk feature.
The Spot CORE compute is an Intel x86 module (initially NUC-class, upgraded in 2023); perception runs on-device with no off-board dependency in basic operation. Operators commonly add a Jetson Orin NX in the EAP payload bay for custom ML perception on top of the BD stack.
Ouster Gemini — robotaxi reference architecture
Ouster’s Gemini perception software stack (acquired from Velodyne, integrated 2024) is the de-facto reference for retrofitting commercial vehicles into AV testbeds. A typical Gemini deployment uses 2 × OS1-128 rooftop spinning LiDARs in a “X” configuration (eliminates blind spots from the LiDAR’s vertical FoV gap), 1 × OS0-128 front bumper short-range, 6–8 ARS548 radars at 77 GHz around the perimeter, and 6–10 Sony IMX490 / OmniVision OX08AT cameras. Compute is split between an NVIDIA DRIVE Orin platform (perception) and a separate planning ECU. This is the reference for shuttle / mining / agriculture autonomy programs, and is what most Tier-2 system integrators build atop.
9. Cross-references
[[Robotics/sensors-pose-motion]]— companion proprioceptive sensors (encoders, resolvers, IMUs); VIO and SLAM tightly couple these with the exteroceptive sensors here.[[Robotics/sensors-force-tactile]]— force, torque, and tactile sensing (planned); the third major exteroceptive modality not covered here.[[Robotics/slam]]— Simultaneous Localisation and Mapping (planned); the algorithmic layer that fuses these sensors into a coherent map and pose estimate.[[Robotics/computer-vision-robotics]]— feature extraction, object detection, semantic segmentation, depth networks (planned); consumes the RGB and depth streams from this note.[[Robotics/bayesian-estimation]]— Kalman / EKF / particle filter fusion (planned); the mathematics of combining multi-rate, multi-modal sensors with appropriate noise models.[[Robotics/kinematics-dh]]— extrinsic calibration of camera-to-body and camera-to-camera transforms uses the same SE(3) machinery.[[Engineering/op-amps]]— analog front-end for SPAD readout, photodiode transimpedance amplifiers in LiDAR, automotive radar IF stages.[[Engineering/electromagnetics-engineering]]— radar physics (FMCW chirp design, antenna arrays, beamforming); LiDAR optical-system trade-offs.[[Engineering/semiconductor-devices]]— CMOS image sensor physics, SPAD avalanche statistics, VCSEL laser-diode operation.[[Languages/Tier3/robotics-control]]— ROS 2 / DDS topic conventions forsensor_msgs/PointCloud2,sensor_msgs/Image,sensor_msgs/CameraInfo.[[Languages/Tier3/3d-scene]]— point-cloud and 3D-mesh formats (planned, USD/glTF/PLY/LAS/E57); the file-format layer for stored map data.
10. Citations
- Szeliski, R. (2022). Computer Vision: Algorithms and Applications (2nd ed.). Springer. Free PDF at szeliski.org/Book. The standard practitioner reference; chapters 11–12 cover stereo, structure-from-motion, depth.
- Hartley, R. & Zisserman, A. (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. The canonical reference for projective geometry, epipolar geometry, camera calibration.
- Forsyth, D. A. & Ponce, J. (2011). Computer Vision: A Modern Approach (2nd ed.). Pearson.
- Hirschmüller, H. (2008). “Stereo Processing by Semiglobal Matching and Mutual Information.” IEEE TPAMI 30(2), 328–341. DOI 10.1109/TPAMI.2007.1166. The SGBM/SGM algorithm at the heart of nearly all production stereo.
- Beraldin, J.-A., Blais, F. & Lohr, U. (2010). “Laser Scanning Technology.” In Airborne and Terrestrial Laser Scanning. Whittles Publishing.
- Geiger, A., Lenz, P. & Urtasun, R. (2012). “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite.” CVPR. The foundational AV perception benchmark using Velodyne HDL-64E.
- Sun, P. et al. (2020). “Scalability in Perception for Autonomous Driving: Waymo Open Dataset.” CVPR. arXiv:1912.04838.
- Caesar, H. et al. (2020). “nuScenes: A multimodal dataset for autonomous driving.” CVPR. The 6-camera, 1-LiDAR, 5-radar reference dataset.
- Hedborg, J. et al. (2012). “Rolling Shutter Bundle Adjustment.” CVPR. The standard rolling-shutter compensation for SfM.
- Gallego, G. et al. (2022). “Event-Based Vision: A Survey.” IEEE TPAMI 44(1). arXiv:1904.08405. Definitive event-camera reference.
- Velodyne (2019). VLP-16 User Manual and Programming Guide, rev D.
- Velodyne (2017). HDL-64E S3 User’s Manual rev J.
- Ouster (2024). OS1 / OS2 Sensor Hardware User Manual rev 7.
- Hesai (2023). Pandar XT32 / QT128 User Manual.
- Innoviz (2023). Innoviz Two Technical Brief.
- Luminar Technologies (2023). Iris+ LiDAR Datasheet.
- Aeva (2023). Aeries II FMCW LiDAR Datasheet.
- Intel (2024). Intel RealSense Product Family Datasheet, rev 26.
- Stereolabs (2024). ZED 2i Camera Specification.
- Prophesee (2023). Metavision EVK4 Datasheet (IMX636 sensor).
- Continental Engineering Services (2023). ARS548 RDI Premium Radar Datasheet.
- Texas Instruments (2024). AWR2944 Single-Chip 77-GHz FMCW Radar Sensor datasheet, SWRS257A.
- Bosch (2024). SMI230 Inertial Measurement Unit for ADAS product brief.
- Sony Semiconductor Solutions (2023). IMX490 / IMX490AAQR Diagonal 6.46 mm 5.4 Mpx HDR Image Sensor datasheet.
- NVIDIA (2024). Jetson Orin Series Datasheet — Jetson Orin Nano / NX / AGX Orin.
- Apple (2020). iPad Pro 2020 LiDAR Scanner — ARKit Documentation.
- Tesla (2022). AI Day 2022 presentation transcripts; (2024). We, Robot event materials.
- Waymo (2023). Waymo Driver — 5th Generation Technical Overview (blog series).
- Skydio (2023). Skydio X10 Product Brief.
- Boston Dynamics (2023). Spot Robot Hardware Specification, rev 4.0.
- ISO 13855:2024. Safety of machinery — Positioning of safeguards with respect to the approach speeds of parts of the human body. Sets minimum-distance requirements that drive sensor-range and refresh-rate spec for safety-rated perception.
- IEC 61496-3:2018. Safety of machinery — Electro-sensitive protective equipment — Particular requirements for active opto-electronic protective devices responsive to diffuse reflection (AOPDDR). The standard for safety-rated 2D LiDAR scanners (SICK S300, Keyence SZ-V).