Visual Servoing — IBVS, PBVS, Hybrid Schemes, Direct Visual Servoing
Closing the control loop directly through vision: Image-Based (IBVS) using pixel features, Position-Based (PBVS) using estimated 3D pose, hybrid 2.5D schemes, and the modern variants — direct photometric servoing, deep visual servoing with neural feature embeddings, and uncalibrated visual servoing. The classical paper-trail (Chaumette, Hutchinson, Espiau, Hashimoto) plus the contemporary deep-learning grafts that handle texture-less objects and large displacements. Foundational for any robotic task where the final pose must be defined relative to what the camera sees, not what a CAD model predicts.
See also
- computer-vision-robotics
- kinematics-dh
- manipulator-design
- end-effectors
- sensors-perception
- bin-picking
- bayesian-estimation
- impedance-control
1. At a glance
Visual servoing (VS) is robot control with vision in the feedback loop. The error signal is built from image measurements — pixel features, region descriptors, or 3D pose estimates — and the controller computes joint or Cartesian commands to drive that error to zero. Unlike “look-then-move” architectures (compute target pose once, then execute open-loop), VS continuously corrects through image feedback, so it tolerates calibration error, kinematic uncertainty, and moving targets that pure planning cannot.
Three canonical schemes dominate the literature and practice:
- Image-Based Visual Servoing (IBVS) — error defined in 2D pixel space; needs the interaction matrix (image Jacobian) relating pixel velocities to camera twist. Classical IBVS (Hashimoto 1991, Espiau 1992, Chaumette 1992, Chaumette & Hutchinson 2006/2007 tutorial) is the standard taught in textbooks.
- Position-Based Visual Servoing (PBVS) — features are reconstructed into 3D pose via known geometry (or learned estimator); error is defined in Cartesian SE(3). Requires accurate camera calibration + object model.
- Hybrid / 2.5D VS (Malis 1999) — decouple translation and rotation; do rotation control in pose space and translation in image space. Best behavior at long distances + accuracy at close.
The 2020s additions:
- Direct (photometric) visual servoing — error is pixel intensity differences, no features extracted. Robust to texture-less / featureless objects (Bakthavatchalam 2015, Crombez 2017).
- Deep visual servoing — neural feature embeddings as the visual error, trained on either synthetic or self-supervised data (Bateux 2018, Saxena 2017, Yu 2020).
- Differentiable rendering VS — gradients through a renderer enable pose-from-image without an explicit pose-estimator (Manuelli 2020, NeRF-supervision 2022).
Where it sits. CV provides the perception input; kinematics provides the robot Jacobian J_r mapping joint velocities to end-effector twist; the interaction matrix L maps camera twist to image-feature velocity; the composite control law is q̇ = J_r⁻¹ · L⁻¹ · (s − s*). Manipulator, gripper, bin picking, and impedance control all build on VS for the final-approach phase.
First ask. Eye-in-hand or eye-to-hand? Eye-in-hand: camera moves with the end-effector — most precision tasks. Eye-to-hand: fixed camera — multi-object overview, less precise but stable. 2D, 2.5D, or 3D feature space? 2D (IBVS): simplest, no model required, but trajectories in 3D can be poor. 3D (PBVS): clean 3D trajectory but needs calibration + model. 2.5D: best of both. Visible features or featureless? Feature-based VS for textured objects with markers/SIFT/ORB; direct VS for surfaces with little texture. How much disparity in initial vs final pose? Small displacements: IBVS converges reliably. Large displacements: PBVS or path-planned approach to a IBVS-feasible zone.
2. First principles
2.1 The interaction matrix (image Jacobian) for a point
For an image feature s = (u, v) corresponding to a 3D point at Z relative to the camera, the velocity of the image point is related to camera-twist v_c = (v_x, v_y, v_z, ω_x, ω_y, ω_z) by:
[u̇] [ -1/Z 0 u/Z uv -(1+u²) v ]
[v̇] = [ 0 -1/Z v/Z 1+v² -uv -u ] · v_c
This 2×6 matrix is L_p (or L_s), the interaction matrix for a single point feature. (Coordinates u, v are normalized image coordinates; multiply by focal length for raw pixels.) For n point features, stack to get L of size 2n×6.
Inverse via pseudo-inverse:
v_c = −λ · L⁺ · (s − s*)
where λ > 0 is the gain, s* is the desired feature configuration, L⁺ = (L^T L)⁻¹ L^T for n ≥ 3 features.
2.2 Three-points minimum
Six DoF needs 6 independent equations → at least 3 point features (giving 6 equations from 6 pixel components). In practice 4+ points are used because:
- Point features can be lost (occlusion, leave FOV).
- Redundant features improve noise rejection.
- Three points alone leave the IBVS scheme prone to local minima (“Chaumette conundrum” — the camera retreats backward to align features that should be approached).
2.3 Stability and the Chaumette conundrum
IBVS with point features has known local minima: configurations where (s − s*) ≠ 0 but L⁺(s − s*) commands camera motion that does not reduce the error globally. The textbook example: features at corners of a square, desired pose perpendicular to the plane. IBVS may command the camera to back away to align before approaching — eventually escaping FOV.
Mitigations:
- Use more features (5+).
- Switch to PBVS for large initial errors.
- Use partitioned control (Corke & Hutchinson 2001) — pull rotation out of IBVS and treat with separate law.
- Use 2.5D hybrid (Malis 1999).
2.4 PBVS pose error
Given measured pose (R, t) of the object w.r.t. camera and desired (R*, t*):
e_t = t − R · R*^T · t* (translation error in camera frame) θu = log(R · R*^T) (rotation error as angle-axis vector)
Camera twist command:
v_c = −λ · [e_t; θu]
This is exact in SE(3); the control law is the SO(3) → so(3) logarithm. Trajectories are geodesics in pose space (straight lines + great-arc rotations), which look natural but may drive features out of FOV.
2.5 2.5D hybrid (Malis 1999)
Decompose:
- Use IBVS for translation (3 image coordinates of a reference point).
- Use PBVS for rotation (axis-angle from estimated R).
- Combine in a 6×6 interaction matrix that is block-diagonal at desired pose.
Pros: avoids feature-leaving-FOV (rotation is bounded a priori); avoids Chaumette retreat (translation in image space is well-behaved); needs partial calibration only.
2.6 Direct (photometric) VS
Error is pixel intensity:
e = I(p, q) − I*(p)
summed over a region. The interaction matrix becomes:
L_I = −∇I · L_p
with ∇I the image gradient at pixel p. Sensitive to lighting + viewpoint change but does not need feature extraction. Used in surface inspection, where features are sparse but textures are reliable.
2.7 Deep visual servoing
Replace the hand-crafted feature s with a learned embedding f_θ(I) ∈ ℝ^d, where the interaction matrix is computed via either:
- Numeric Jacobian: perturb camera by small ε in 6 directions; compute (f(I_perturbed) − f(I)) / ε. Slow.
- Backpropagation through the encoder: ∂f / ∂I × ∂I / ∂v_c with the second factor learned or computed analytically.
- Self-supervised contrastive training (Bateux 2018, Yu 2020) so the embedding space is approximately linear in camera twist.
Advantages: robust to lighting / texture / partial occlusion. Disadvantages: less geometric guarantees than classical schemes.
2.8 Robot-Jacobian coupling
Total chain: joint velocities → end-effector twist (via robot Jacobian J_r) → camera twist (via fixed eye-in-hand transform ^cT_e) → image feature velocity (via interaction matrix L).
q̇ = J_r⁻¹ · ^eT_c · L⁺ · (−λ(s − s*))
When eye-in-hand: ^eT_c is the fixed extrinsic. When eye-to-hand: relate camera frame to robot base frame via ^bT_c, calibrate offline. The “hand-eye calibration” problem solves AX=XB to find this extrinsic (Tsai-Lenz 1989, Daniilidis 1999, Park-Martin 1994).
3. Practical math — sized examples + control law
3.1 Worked example: 4-point IBVS to align a planar target
Setup: eye-in-hand camera (640×480, fx=fy=500), 4 corners of a 100×100 mm planar target visible. Target at Z ≈ 0.5 m from camera. Desired image: corners at (200,200), (440,200), (440,360), (200,360) — i.e., target perfectly centered + axis-aligned.
Steps:
- Detect 4 corners (Harris / Shi-Tomasi / ArUco). Get current pixel positions (u_i, v_i).
- Normalize: u_n = (u − cx) / fx, v_n = (v − cy) / fy.
- Build L (8×6) from §2.1 using each corner’s (u_n, v_n) and estimated Z (use depth sensor or known target size).
- Compute error e = (s − s*) ∈ ℝ⁸.
- v_c = −λ · L⁺ · e, with λ = 0.5.
- Transform to end-effector twist: v_e = ^eR_c · v_c (no offset; cameras are usually rigidly mounted).
- Joint velocities: q̇ = J_r⁻¹ · v_e.
- Send q̇ to controller; iterate at 30–100 Hz.
Expected behavior: corners converge to desired positions in ~50–200 iterations (1–5 s). Trajectory of camera in 3D is a curve, not straight line — image-space straight, 3D-space curved (well-known IBVS property).
3.2 PBVS variant
Same setup, but:
- Estimate 3D pose of target via PnP (Lepetit’s EPnP, OpenCV
solvePnP) using 4 correspondences. - e_t = t_current − R_current · R_desired^T · t_desired.
- θu = log(R_current · R_desired^T).
- v_c = −λ · [e_t; θu].
- Same transformations through robot Jacobian.
Trajectory of camera is straight line in 3D — corners take curved paths in image, may leave FOV near boundaries.
3.3 Adaptive Z (depth) estimation
The interaction matrix depends on Z. Three strategies:
- Constant Z (Z = Z*): use the desired-pose depth. Works for small displacements; fails for large.
- Online estimation: depth sensor (RealSense, ToF) gives Z directly per feature.
- Z estimated from feature motion (Papanikolopoulos 1995): with measured image velocity ṡ and commanded camera velocity v_c, solve for Z from L_p(Z) · v_c = ṡ.
3.4 Hand-eye calibration (Tsai-Lenz)
Given multiple (^bT_e, ^cT_obj) pairs, solve:
AX = XB
where A = ^bT_{e,i}^{-1} · ^bT_{e,j}, B = ^cT_{obj,i} · ^cT_{obj,j}^{-1}, X = ^eT_c.
Tsai-Lenz two-step solution:
- Solve rotation: R_x from R_a · R_x = R_x · R_b using axis-angle ratio.
- Solve translation: t_x from (R_a − I) · t_x = R_x · t_b − t_a.
Modern alternative: dual-quaternion formulation (Daniilidis 1999), solve as a single linear system. OpenCV provides calibrateHandEye() with multiple methods.
3.5 Latency budget
Vision pipeline at 30 Hz: capture (~5 ms exposure + 10 ms readout) + detection (5–30 ms) + tracking (1–5 ms) + total ~40–60 ms per frame. Joint command at 1 kHz. The VS loop is rate-limited by vision. For high-bandwidth tasks (peg-in-hole final approach), use predictive control + IMU motion compensation.
4. Design heuristics
- Eye-in-hand for precision; eye-to-hand for context. Eye-in-hand gives sub-pixel feedback at close range. Eye-to-hand sees the whole workspace but is poor for small alignment.
- Mark your target. ArUco / AprilTag fiducials remove detection ambiguity. Used in production robotics + research. Camera calibration via charuco boards is standard practice.
- Use enough features. 4 minimum, 6–10 typical, 20+ for robust real-time tracking. Spread across image to maximize observability — features all in one quadrant gives a poorly-conditioned L.
- Mix feature types. Points + lines + circles. Different features have different sensitivities to motion directions; combining stabilizes.
- Don’t normalize the camera too late. Pixel coordinates have non-uniform sensitivity (corner pixels at depth Z have different metric meaning than center). Always work in normalized image coordinates inside the controller.
- Saturate the gain. λ too high → instability + servo overshoot. λ too low → slow convergence. Typical λ = 0.3–1.0 for 30 Hz vision; tune by step response.
- Use velocity feedforward. If target is moving (tracked over time), include estimated target velocity in the command: v_c = −λ · L⁺ · (s − s*) + L⁺ · ṡ*.
- Switch schemes by distance. PBVS or planned approach when far; IBVS when close. Cuts trajectory length while preserving final-pose accuracy.
- Plan path through feature space, not configuration space. For VS at the final phase, the feature path determines whether features stay in FOV. Mellinger-style path-planning over the feature manifold is now a niche but useful technique.
- Filter your features. Detected pixel positions are noisy. Low-pass filter or Kalman filter the features before computing error.
5. Components & sourcing — vision + feature libraries
5.1 Camera + sensor sourcing
| Vendor / model | Resolution | Frame rate | Notes |
|---|---|---|---|
| FLIR Blackfly S USB3 (BFS-U3-13Y3C) | 1280×1024 | 170 fps | Industrial workhorse |
| Basler ace 2 USB3 / GigE (a2A2590-60ucBAS) | 2592×2048 | 60 fps | Wide industrial use |
| Intel RealSense D435i / D455 | 1280×720 RGB + depth | 90 fps | Built-in IMU |
| Intel RealSense D405 | 1280×720 stereo | 90 fps | Short-range (7–50 cm) for in-hand |
| Stereolabs Zed 2i / Zed X | 2K stereo | 60 fps | Outdoor + neural depth |
| Luxonis OAK-D / OAK-D-Pro | 1080p RGB + stereo depth | 30 fps | Onboard Movidius VPU |
| Ximea xiQ MQ013MG | 1280×1024 | 60 fps | Subminiature for in-hand |
| Sony Spresense + IMX458 | 1080p | 60 fps | DIY/research |
5.2 Fiducial / marker systems
| Marker | Notes |
|---|---|
ArUco (OpenCV aruco module) | 4×4 to 7×7 ID grids; classic |
| AprilTag (Olson 2011, AprilRobotics) | Higher detection range; v3 + Tag36h11 standard |
| ChArUco board | Chessboard + ArUco fused; standard camera-calibration target |
| Reflective spherical markers (Vicon, OptiTrack) | Motion-capture grade; sub-mm accuracy |
| STag (Benligiray 2019) | More robust to perspective + scale |
5.3 Feature-extraction libraries
| Library | Detectors | License |
|---|---|---|
| OpenCV (C++ / Python) | ORB, SIFT (free since 2020), AKAZE, BRISK, Harris, GFTT, SURF (non-free) | Apache 2 (some Patents) |
| OpenGV (Kneip) | Geometric verification + PnP | BSD |
| SuperPoint (Magic Leap 2018) | Deep keypoints | MIT |
| LightGlue (Lindenberger 2023) | Faster successor to SuperGlue | Apache 2 |
| DINOv2 (Meta 2023) | Self-supervised dense features | Apache 2 |
| LoFTR (Sun 2021) | Detector-free matching | Apache 2 |
| FoundationPose (NVIDIA 2024) | One-shot 6D pose estimation | NVIDIA / open |
| ViSP (INRIA, since 1996) | Reference visual-servoing toolbox | GPL |
| ARToolKit / ARToolKitX | Classical AR tracking | LGPL |
5.4 Robot integration
| Stack | VS support |
|---|---|
ROS 2 + vision_visp | Bridge ROS images ↔ ViSP servo |
| MoveIt 2 + visual servoing plugins | Limited; usually direct interfacing |
| Franka FCI + libfranka | Direct Cartesian velocity command interface |
| Universal Robots RTDE | 125 Hz Cartesian / joint velocity command |
| KUKA KSS / Sunrise FRI | 250 Hz / 1 kHz interfaces |
| ABB EGM (Externally Guided Motion) | 250 Hz |
5.5 Camera calibration toolchains
- OpenCV: cv2.calibrateCamera (Zhang's method, charuco/checkerboard)
- ROS camera_calibration (built on OpenCV)
- Kalibr (Furgale, ETH) — supports camera-IMU, multi-camera, multi-modal
- COLMAP — structure-from-motion, useful for extrinsic recovery
- Multical (HKUST) — multi-camera rig calibration
- AprilGrid (AprilRobotics) — calibration target for Kalibr
- Vicalib (NASA Ames) — used for Astrobee NavCam calibration
- Matlab Camera Calibration Toolbox (Bouguet) — historical reference
5.6 Hand-eye solver implementations
- OpenCV cv2.calibrateHandEye:
method: TSAI / PARK / HORAUD / ANDREFF / DANIILIDIS
- ROS easy_handeye package (wraps OpenCV)
- visp.hand_to_eye_calibration (ViSP)
- HandEyeCalibration (MATLAB Robotics Toolbox)
- Custom CV+SE(3) optimization (Pinocchio, Sophus, Manif libraries)
6. Reference data
6.1 Method comparison
| Scheme | Trajectory | Calibration | Local minima | Speed |
|---|---|---|---|---|
| IBVS (point) | Curved in 3D | Coarse | Yes (Chaumette retreat) | Fast |
| IBVS (line / circle / moment) | Better behaved | Coarse | Less | Slower |
| PBVS | Straight in 3D | Exact extrinsic + intrinsic + model | None (in pose space) | Moderate |
| 2.5D hybrid | Good in both spaces | Partial | None | Moderate |
| Direct (photometric) | Smooth | Intrinsic only | Texture-dependent | Fast |
| Deep VS | Learned | None (in some variants) | Network-dependent | Moderate |
6.2 Notable visual-servoing milestones
| Year | Project | Contribution |
|---|---|---|
| 1979 | Shirai-Inoue | First closed-loop vision robot |
| 1980 | Hill-Park | Block-stacking robot at SRI |
| 1991 | Hashimoto et al. | First systematic IBVS framework |
| 1992 | Espiau, Chaumette, Rives | ”Task function” formulation |
| 1996 | Hutchinson, Hager, Corke | ”Tutorial on Visual Servo Control” (IEEE TRA) |
| 1999 | Malis, Chaumette, Boudet | 2.5D hybrid VS |
| 2001 | Corke & Hutchinson | Partitioned IBVS |
| 2006/2007 | Chaumette, Hutchinson | Two-part Tutorial in IEEE RAM (modern reference) |
| 2011 | AprilTag (Olson) | Robust fiducial standard |
| 2018 | Bateux et al. | Deep VS with CNN features |
| 2020 | Manuelli et al. | Keypoint-Affordance with dense descriptors |
| 2022 | Florence et al. | NeRF-supervised servoing |
| 2024 | NVIDIA FoundationPose | One-shot 6D pose for any object |
6.3 ViSP example task code (pseudo)
vpServo servo;
servo.setServoType(vpServo::EYE_IN_HAND);
servo.setInteractionMatrixType(vpServo::CURRENT); // recompute L each step
// Define 4 point features
vpFeaturePoint p[4], pd[4]; // current, desired
for (int i = 0; i < 4; i++) {
p[i].buildFrom(u[i], v[i], Z[i]);
pd[i].buildFrom(u_des[i], v_des[i], Z_des[i]);
servo.addFeature(p[i], pd[i]);
}
servo.setLambda(0.5); // adaptive: servo.setLambda(vpAdaptiveGain(4.0, 0.4, 30.0));
while (!converged) {
grab_image();
track_features(); // update u[i], v[i]
for (int i = 0; i < 4; i++) p[i].buildFrom(u[i], v[i], Z[i]);
vpColVector v = servo.computeControlLaw();
send_to_robot(v); // 6D camera twist
}6.4 Stability + convergence properties
| Initial error magnitude | IBVS | PBVS | Hybrid |
|---|---|---|---|
| Small (< 10° rotation, < 10 cm trans) | Stable, fast | Stable, fast | Stable, fast |
| Moderate (30°, 30 cm) | Slow / oscillatory | Stable | Stable |
| Large (90°+) | Risk: feature exit FOV | Risk: 3D pose drift | Best (rotation in pose, trans in image) |
| Approaching singularity | L conditioning poor | Not affected | Mixed |
| Featureless surface | Fails (no features) | Fails (no PnP) | Fails; use direct VS |
7. Failure modes & debugging
- Features exit FOV during convergence — common IBVS failure. Detect: tracker loses point. Fix: more features (corners + center), partitioned control, or pre-plan a feasible approach.
- Singular interaction matrix — features collinear or co-planar in degenerate ways. Detect: condition number of L spikes. Fix: re-position camera; add diverse features (line + point + circle).
- Hand-eye calibration error — VS converges to wrong pose offset. Detect: residual after VS settles. Fix: recalibrate; verify checkerboard images; use Tsai-Lenz or dual-quaternion methods.
- Lighting change mid-task — detection threshold misclassifies features. Fix: auto-exposure off, fixed gain, controlled lighting; or feature-invariant descriptors (ORB > Harris).
- Z (depth) estimation drift — IBVS gain becomes wrong as Z error grows. Fix: use measured Z from depth sensor; or adapt Z from feature motion.
- Latency-induced oscillation — vision is 30 Hz, robot at 1 kHz, controller lags. Fix: motion compensation between image samples; predict via IMU; reduce gain λ.
- Outlier features — single mis-detected corner biases the pseudo-inverse. Fix: RANSAC on features, or weighted least squares with M-estimator (Huber, Tukey).
- Chaumette retreat — camera moves backward to align features. Fix: switch to 2.5D or partitioned control; use distance threshold to gate IBVS.
- Camera saturation / motion blur — fast motion produces blurred features. Fix: shorter exposure (and brighter lighting); higher gain; trigger on hardware sync; or use event cameras (DAVIS, Prophesee).
- Marker tag occlusion — ArUco partly hidden by gripper. Fix: multiple markers, redundancy; or switch to texture-based tracking.
7.1 Adaptive gain schemes
Standard fixed-λ control: v = −λ · L⁺ · e. Adaptive variants:
- Step-by-step: λ = λ_max when ||e|| < threshold; λ = λ_min when far. Avoids saturation at large error.
- Exponential: λ = λ_∞ + (λ_0 − λ_∞) · exp(−k ||e||). ViSP’s
vpAdaptiveGainstandard pattern. - State-dependent: λ derived from current camera-target distance (PBVS only).
7.2 Singularity-robust pseudo-inverse
When L is near-singular, naive pseudo-inverse blows up. Use damped least-squares (DLS / Levenberg-Marquardt):
L⁺_dls = (L^T L + α² I)⁻¹ L
with α = damping factor (0.01–0.1 typical). Or compute via SVD with small-singular-value truncation. Standard in production VS.
8. Case studies
8.1 Universal Robots peg-in-hole (commercial)
Universal Robots’ “URCap Force Copilot” + camera package implements eye-in-hand VS + force feedback for peg insertion. Workflow:
- Vision-based approach to ±2 mm tolerance using AprilTag on socket.
- Force-controlled insertion with impedance control (see impedance-control) at low Z stiffness.
- Tolerance: ~0.1 mm final alignment for round pegs, ~0.05 mm for D-shaped.
UR uses IBVS with the wrist-mounted RealSense D405 (short-range, 7–50 cm). Standard production deployment across automotive sub-assembly.
8.2 Surgical da Vinci with vision feedback
Intuitive Surgical’s da Vinci system: 3-arm tele-operated surgical robot with stereo endoscope. While the surgeon controls primary motion via master haptic devices, autonomous sub-tasks (knot tying, suture pickup) use VS with stereo-derived 3D features. The “Smart Tissue Autonomous Robot” (STAR, Krieger 2020) demonstrated supervised autonomous suturing on porcine intestine with VS-based alignment. Real eye-in-hand VS with anatomical features as the targets — sutures are tracked frame-to-frame with bespoke detectors.
8.3 Hutchinson tutorial benchmark (Robotnik / INRIA ViSP)
ViSP (Visual Servoing Platform, INRIA) ships with reference examples that became the de facto teaching demos:
- 4-point IBVS planar
- Sphere IBVS (3D feature)
- Cylinder IBVS (line feature)
- Eye-in-hand PBVS on planar target
- 2.5D hybrid
All run on a Franka Panda or UR5 with a wrist camera; reproducible in any university lab.
8.4 NVIDIA FoundationPose + VS for novel objects (2024)
FoundationPose: one-shot 6D pose estimator for any object given a single reference image / CAD. Combined with PBVS:
- Show robot a new object once (image or scan).
- FoundationPose tracks pose in subsequent frames at 30+ Hz.
- PBVS drives gripper to a learned grasp pose relative to the object.
Bypasses the per-object fiducial marker requirement; enables fast deployment in unstructured environments. Open-sourced by NVIDIA in 2024.
8.5 ABB visual servoing on car body — automotive welding
ABB IRB 7600 + camera + laser pointer on a body-in-white welding line. The robot pre-positions to nominal weld point (open-loop from CAD), then visual servoing corrects for body-position variation up to ±15 mm. The vision module measures the seam edge and adjusts the welding TCP (tool-center-point) within 50 ms before each weld initiates. Standard practice across automotive OEMs since ~2005; the controller runs hand-tuned IBVS or industrial PBVS implementation.
8.6 Soft robotics + photometric servoing (Crombez 2017)
Eye-in-hand on a 6-DoF arm tracking a featureless surface. Direct photometric VS uses pixel intensity differences (no feature extraction). The robot aligns its camera parallel to the surface (~1 cm standoff) by minimizing sum-of-squared-differences (SSD) between current image and reference. Worked at 30 Hz on a 320×240 ROI. Useful for surface-inspection tasks where the surface has no distinctive features (painted metal, polished glass).
8.7 Drone visual landing (DJI / Skydio)
Skydio X10D and X2 use eye-down camera + IBVS-style alignment to land on a 1.2 m × 1.2 m landing pad with an AprilTag-derived marker. Approach:
- Long-range descent guided by GPS to ~3 m above pad.
- Switch to vision-only at < 3 m AGL.
- Detect ArUco/Skydio-marker.
- IBVS center the marker in the frame while descending.
- Touchdown when marker covers ≥ 60% of frame + IMU detects contact.
Used in autonomous return-to-home + docking-on-vehicle (military). DJI Matrice 350 RTK + RTK Beacon implements similar protocol.
8.8 Industrial assembly with hybrid VS + force
Comau / Stäubli / Fanuc industrial cells routinely use:
- Open-loop traverse to nominal grasp pose (fast, ~50 mm tolerance).
- Vision-based final approach for ±2 mm correction (camera mounted on Z-axis).
- Force-feedback insertion for sub-mm alignment (impedance control).
Standard sequence for engine-block bolt-insertion: ~3 s open-loop traverse + 1.5 s VS + 0.5 s force-insertion. Achievable cycle time 5 s with > 99.9% success.
8.9 Pick-and-place with deep features (Mahler 2017 Dex-Net + VS)
UC Berkeley AUTOLAB’s Dex-Net family: trained deep grasp-quality network on synthetic depth images. Pipeline:
- RGBD camera fixed overhead.
- Dex-Net inference: per-pixel grasp success likelihood + 6-DoF grasp pose.
- Robot transitions to predicted grasp via open-loop motion plus optional VS final correction.
Not strictly “VS” but uses vision in the loop with learned features. Standard in academic warehouse research. Extends to ManiSkill / RAPID benchmarks.
Adjacent
- digital-control
- classical-control
- realtime-embedded
- signal-processing-dsp
- photonics
- electromagnetics-engineering
- microcontrollers
- system-identification
Comparison: VS vs alternatives
Open-loop motion (CAD-based plan):
+ Simple, fast
- Sensitive to calibration drift, fixture variation
- Used as a coarse approach before VS
Visual servoing:
+ Tolerates kinematic uncertainty, calibration drift
+ Adapts to moving targets
- Requires sustained visibility of features
- Vision-loop latency limits bandwidth
Force-only feedback (impedance):
+ Robust to vision failure (lighting, occlusion)
- Cannot disambiguate before contact
Hybrid (VS + force):
+ Vision for pre-contact alignment
+ Force for post-contact insertion
- Standard combination in production
Learning-based (end-to-end neural):
+ Generalizes to novel objects (foundation models)
+ Handles texture-less surfaces
- Less interpretable; difficult to verify
- Standard for general manipulation 2024+
Citations
- Hashimoto K., Kimoto T., Ebine T., Kimura H. (1991) “Manipulator Control with Image-Based Visual Servo.” ICRA.
- Espiau B., Chaumette F., Rives P. (1992) “A New Approach to Visual Servoing in Robotics.” IEEE Trans Robotics & Automation 8(3).
- Chaumette F. (1998) “Potential Problems of Stability and Convergence in Image-Based and Position-Based Visual Servoing.” in The Confluence of Vision and Control, LNCIS 237.
- Hutchinson S., Hager G., Corke P. (1996) “A Tutorial on Visual Servo Control.” IEEE Trans Robotics & Automation 12(5).
- Chaumette F., Hutchinson S. (2006) “Visual Servo Control I: Basic Approaches.” IEEE Robotics & Automation Magazine 13(4).
- Chaumette F., Hutchinson S. (2007) “Visual Servo Control II: Advanced Approaches.” IEEE Robotics & Automation Magazine 14(1).
- Malis E., Chaumette F., Boudet S. (1999) “2½D Visual Servoing.” IEEE Trans Robotics & Automation 15(2).
- Corke P., Hutchinson S. (2001) “A New Partitioned Approach to Image-Based Visual Servo Control.” IEEE Trans Robotics & Automation 17(4).
- Tsai R.Y., Lenz R.K. (1989) “A New Technique for Fully Autonomous and Efficient 3D Robotics Hand/Eye Calibration.” IEEE Trans Robotics & Automation 5(3).
- Daniilidis K. (1999) “Hand-Eye Calibration Using Dual Quaternions.” Int J Robotics Research 18(3).
- Park F.C., Martin B.J. (1994) “Robot Sensor Calibration: Solving AX=XB on the Euclidean Group.” IEEE Trans Robotics & Automation 10(5).
- Papanikolopoulos N.P., Khosla P.K. (1993) “Adaptive Robotic Visual Tracking: Theory and Experiments.” IEEE Trans Automatic Control 38(3).
- Bakthavatchalam M., Chaumette F., Marchand E. (2015) “Photometric Moments: New Promising Candidates for Visual Servoing.” ICRA.
- Crombez N., Mouaddib E.M., Caron G., Chaumette F. (2017) “Direct Visual Servoing Using Spline-Based Photometric Functions.” ICRA.
- Bateux Q., Marchand E., Leitner J., Chaumette F., Corke P. (2018) “Training Deep Neural Networks for Visual Servoing.” ICRA.
- Manuelli L., Li Y., Florence P., Tedrake R. (2020) “Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning.” CoRL.
- Lepetit V., Moreno-Noguer F., Fua P. (2009) “EPnP: An Accurate O(n) Solution to the PnP Problem.” IJCV 81.
- Olson E. (2011) “AprilTag: A Robust and Flexible Visual Fiducial System.” ICRA.
- ViSP project (INRIA, Marchand and Chaumette)
visp.inria.fr. - NVIDIA (2024) “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects.” CVPR 2024 best paper.