Visual Servoing — IBVS, PBVS, Hybrid Schemes, Direct Visual Servoing

Closing the control loop directly through vision: Image-Based (IBVS) using pixel features, Position-Based (PBVS) using estimated 3D pose, hybrid 2.5D schemes, and the modern variants — direct photometric servoing, deep visual servoing with neural feature embeddings, and uncalibrated visual servoing. The classical paper-trail (Chaumette, Hutchinson, Espiau, Hashimoto) plus the contemporary deep-learning grafts that handle texture-less objects and large displacements. Foundational for any robotic task where the final pose must be defined relative to what the camera sees, not what a CAD model predicts.

See also

1. At a glance

Visual servoing (VS) is robot control with vision in the feedback loop. The error signal is built from image measurements — pixel features, region descriptors, or 3D pose estimates — and the controller computes joint or Cartesian commands to drive that error to zero. Unlike “look-then-move” architectures (compute target pose once, then execute open-loop), VS continuously corrects through image feedback, so it tolerates calibration error, kinematic uncertainty, and moving targets that pure planning cannot.

Three canonical schemes dominate the literature and practice:

  • Image-Based Visual Servoing (IBVS) — error defined in 2D pixel space; needs the interaction matrix (image Jacobian) relating pixel velocities to camera twist. Classical IBVS (Hashimoto 1991, Espiau 1992, Chaumette 1992, Chaumette & Hutchinson 2006/2007 tutorial) is the standard taught in textbooks.
  • Position-Based Visual Servoing (PBVS) — features are reconstructed into 3D pose via known geometry (or learned estimator); error is defined in Cartesian SE(3). Requires accurate camera calibration + object model.
  • Hybrid / 2.5D VS (Malis 1999) — decouple translation and rotation; do rotation control in pose space and translation in image space. Best behavior at long distances + accuracy at close.

The 2020s additions:

  • Direct (photometric) visual servoing — error is pixel intensity differences, no features extracted. Robust to texture-less / featureless objects (Bakthavatchalam 2015, Crombez 2017).
  • Deep visual servoing — neural feature embeddings as the visual error, trained on either synthetic or self-supervised data (Bateux 2018, Saxena 2017, Yu 2020).
  • Differentiable rendering VS — gradients through a renderer enable pose-from-image without an explicit pose-estimator (Manuelli 2020, NeRF-supervision 2022).

Where it sits. CV provides the perception input; kinematics provides the robot Jacobian J_r mapping joint velocities to end-effector twist; the interaction matrix L maps camera twist to image-feature velocity; the composite control law is q̇ = J_r⁻¹ · L⁻¹ · (s − s*). Manipulator, gripper, bin picking, and impedance control all build on VS for the final-approach phase.

First ask. Eye-in-hand or eye-to-hand? Eye-in-hand: camera moves with the end-effector — most precision tasks. Eye-to-hand: fixed camera — multi-object overview, less precise but stable. 2D, 2.5D, or 3D feature space? 2D (IBVS): simplest, no model required, but trajectories in 3D can be poor. 3D (PBVS): clean 3D trajectory but needs calibration + model. 2.5D: best of both. Visible features or featureless? Feature-based VS for textured objects with markers/SIFT/ORB; direct VS for surfaces with little texture. How much disparity in initial vs final pose? Small displacements: IBVS converges reliably. Large displacements: PBVS or path-planned approach to a IBVS-feasible zone.

2. First principles

2.1 The interaction matrix (image Jacobian) for a point

For an image feature s = (u, v) corresponding to a 3D point at Z relative to the camera, the velocity of the image point is related to camera-twist v_c = (v_x, v_y, v_z, ω_x, ω_y, ω_z) by:

[u̇]   [ -1/Z    0    u/Z    uv     -(1+u²)    v ]
[v̇] = [   0  -1/Z    v/Z   1+v²     -uv      -u ] · v_c

This 2×6 matrix is L_p (or L_s), the interaction matrix for a single point feature. (Coordinates u, v are normalized image coordinates; multiply by focal length for raw pixels.) For n point features, stack to get L of size 2n×6.

Inverse via pseudo-inverse:

v_c = −λ · L⁺ · (s − s*)

where λ > 0 is the gain, s* is the desired feature configuration, L⁺ = (L^T L)⁻¹ L^T for n ≥ 3 features.

2.2 Three-points minimum

Six DoF needs 6 independent equations → at least 3 point features (giving 6 equations from 6 pixel components). In practice 4+ points are used because:

  • Point features can be lost (occlusion, leave FOV).
  • Redundant features improve noise rejection.
  • Three points alone leave the IBVS scheme prone to local minima (“Chaumette conundrum” — the camera retreats backward to align features that should be approached).

2.3 Stability and the Chaumette conundrum

IBVS with point features has known local minima: configurations where (s − s*) ≠ 0 but L⁺(s − s*) commands camera motion that does not reduce the error globally. The textbook example: features at corners of a square, desired pose perpendicular to the plane. IBVS may command the camera to back away to align before approaching — eventually escaping FOV.

Mitigations:

  • Use more features (5+).
  • Switch to PBVS for large initial errors.
  • Use partitioned control (Corke & Hutchinson 2001) — pull rotation out of IBVS and treat with separate law.
  • Use 2.5D hybrid (Malis 1999).

2.4 PBVS pose error

Given measured pose (R, t) of the object w.r.t. camera and desired (R*, t*):

e_t = t − R · R*^T · t* (translation error in camera frame) θu = log(R · R*^T) (rotation error as angle-axis vector)

Camera twist command:

v_c = −λ · [e_t; θu]

This is exact in SE(3); the control law is the SO(3) → so(3) logarithm. Trajectories are geodesics in pose space (straight lines + great-arc rotations), which look natural but may drive features out of FOV.

2.5 2.5D hybrid (Malis 1999)

Decompose:

  • Use IBVS for translation (3 image coordinates of a reference point).
  • Use PBVS for rotation (axis-angle from estimated R).
  • Combine in a 6×6 interaction matrix that is block-diagonal at desired pose.

Pros: avoids feature-leaving-FOV (rotation is bounded a priori); avoids Chaumette retreat (translation in image space is well-behaved); needs partial calibration only.

2.6 Direct (photometric) VS

Error is pixel intensity:

e = I(p, q) − I*(p)

summed over a region. The interaction matrix becomes:

L_I = −∇I · L_p

with ∇I the image gradient at pixel p. Sensitive to lighting + viewpoint change but does not need feature extraction. Used in surface inspection, where features are sparse but textures are reliable.

2.7 Deep visual servoing

Replace the hand-crafted feature s with a learned embedding f_θ(I) ∈ ℝ^d, where the interaction matrix is computed via either:

  • Numeric Jacobian: perturb camera by small ε in 6 directions; compute (f(I_perturbed) − f(I)) / ε. Slow.
  • Backpropagation through the encoder: ∂f / ∂I × ∂I / ∂v_c with the second factor learned or computed analytically.
  • Self-supervised contrastive training (Bateux 2018, Yu 2020) so the embedding space is approximately linear in camera twist.

Advantages: robust to lighting / texture / partial occlusion. Disadvantages: less geometric guarantees than classical schemes.

2.8 Robot-Jacobian coupling

Total chain: joint velocities → end-effector twist (via robot Jacobian J_r) → camera twist (via fixed eye-in-hand transform ^cT_e) → image feature velocity (via interaction matrix L).

q̇ = J_r⁻¹ · ^eT_c · L⁺ · (−λ(s − s*))

When eye-in-hand: ^eT_c is the fixed extrinsic. When eye-to-hand: relate camera frame to robot base frame via ^bT_c, calibrate offline. The “hand-eye calibration” problem solves AX=XB to find this extrinsic (Tsai-Lenz 1989, Daniilidis 1999, Park-Martin 1994).

3. Practical math — sized examples + control law

3.1 Worked example: 4-point IBVS to align a planar target

Setup: eye-in-hand camera (640×480, fx=fy=500), 4 corners of a 100×100 mm planar target visible. Target at Z ≈ 0.5 m from camera. Desired image: corners at (200,200), (440,200), (440,360), (200,360) — i.e., target perfectly centered + axis-aligned.

Steps:

  1. Detect 4 corners (Harris / Shi-Tomasi / ArUco). Get current pixel positions (u_i, v_i).
  2. Normalize: u_n = (u − cx) / fx, v_n = (v − cy) / fy.
  3. Build L (8×6) from §2.1 using each corner’s (u_n, v_n) and estimated Z (use depth sensor or known target size).
  4. Compute error e = (s − s*) ∈ ℝ⁸.
  5. v_c = −λ · L⁺ · e, with λ = 0.5.
  6. Transform to end-effector twist: v_e = ^eR_c · v_c (no offset; cameras are usually rigidly mounted).
  7. Joint velocities: q̇ = J_r⁻¹ · v_e.
  8. Send q̇ to controller; iterate at 30–100 Hz.

Expected behavior: corners converge to desired positions in ~50–200 iterations (1–5 s). Trajectory of camera in 3D is a curve, not straight line — image-space straight, 3D-space curved (well-known IBVS property).

3.2 PBVS variant

Same setup, but:

  1. Estimate 3D pose of target via PnP (Lepetit’s EPnP, OpenCV solvePnP) using 4 correspondences.
  2. e_t = t_current − R_current · R_desired^T · t_desired.
  3. θu = log(R_current · R_desired^T).
  4. v_c = −λ · [e_t; θu].
  5. Same transformations through robot Jacobian.

Trajectory of camera is straight line in 3D — corners take curved paths in image, may leave FOV near boundaries.

3.3 Adaptive Z (depth) estimation

The interaction matrix depends on Z. Three strategies:

  • Constant Z (Z = Z*): use the desired-pose depth. Works for small displacements; fails for large.
  • Online estimation: depth sensor (RealSense, ToF) gives Z directly per feature.
  • Z estimated from feature motion (Papanikolopoulos 1995): with measured image velocity ṡ and commanded camera velocity v_c, solve for Z from L_p(Z) · v_c = ṡ.

3.4 Hand-eye calibration (Tsai-Lenz)

Given multiple (^bT_e, ^cT_obj) pairs, solve:

AX = XB

where A = ^bT_{e,i}^{-1} · ^bT_{e,j}, B = ^cT_{obj,i} · ^cT_{obj,j}^{-1}, X = ^eT_c.

Tsai-Lenz two-step solution:

  1. Solve rotation: R_x from R_a · R_x = R_x · R_b using axis-angle ratio.
  2. Solve translation: t_x from (R_a − I) · t_x = R_x · t_b − t_a.

Modern alternative: dual-quaternion formulation (Daniilidis 1999), solve as a single linear system. OpenCV provides calibrateHandEye() with multiple methods.

3.5 Latency budget

Vision pipeline at 30 Hz: capture (~5 ms exposure + 10 ms readout) + detection (5–30 ms) + tracking (1–5 ms) + total ~40–60 ms per frame. Joint command at 1 kHz. The VS loop is rate-limited by vision. For high-bandwidth tasks (peg-in-hole final approach), use predictive control + IMU motion compensation.

4. Design heuristics

  • Eye-in-hand for precision; eye-to-hand for context. Eye-in-hand gives sub-pixel feedback at close range. Eye-to-hand sees the whole workspace but is poor for small alignment.
  • Mark your target. ArUco / AprilTag fiducials remove detection ambiguity. Used in production robotics + research. Camera calibration via charuco boards is standard practice.
  • Use enough features. 4 minimum, 6–10 typical, 20+ for robust real-time tracking. Spread across image to maximize observability — features all in one quadrant gives a poorly-conditioned L.
  • Mix feature types. Points + lines + circles. Different features have different sensitivities to motion directions; combining stabilizes.
  • Don’t normalize the camera too late. Pixel coordinates have non-uniform sensitivity (corner pixels at depth Z have different metric meaning than center). Always work in normalized image coordinates inside the controller.
  • Saturate the gain. λ too high → instability + servo overshoot. λ too low → slow convergence. Typical λ = 0.3–1.0 for 30 Hz vision; tune by step response.
  • Use velocity feedforward. If target is moving (tracked over time), include estimated target velocity in the command: v_c = −λ · L⁺ · (s − s*) + L⁺ · ṡ*.
  • Switch schemes by distance. PBVS or planned approach when far; IBVS when close. Cuts trajectory length while preserving final-pose accuracy.
  • Plan path through feature space, not configuration space. For VS at the final phase, the feature path determines whether features stay in FOV. Mellinger-style path-planning over the feature manifold is now a niche but useful technique.
  • Filter your features. Detected pixel positions are noisy. Low-pass filter or Kalman filter the features before computing error.

5. Components & sourcing — vision + feature libraries

5.1 Camera + sensor sourcing

Vendor / modelResolutionFrame rateNotes
FLIR Blackfly S USB3 (BFS-U3-13Y3C)1280×1024170 fpsIndustrial workhorse
Basler ace 2 USB3 / GigE (a2A2590-60ucBAS)2592×204860 fpsWide industrial use
Intel RealSense D435i / D4551280×720 RGB + depth90 fpsBuilt-in IMU
Intel RealSense D4051280×720 stereo90 fpsShort-range (7–50 cm) for in-hand
Stereolabs Zed 2i / Zed X2K stereo60 fpsOutdoor + neural depth
Luxonis OAK-D / OAK-D-Pro1080p RGB + stereo depth30 fpsOnboard Movidius VPU
Ximea xiQ MQ013MG1280×102460 fpsSubminiature for in-hand
Sony Spresense + IMX4581080p60 fpsDIY/research

5.2 Fiducial / marker systems

MarkerNotes
ArUco (OpenCV aruco module)4×4 to 7×7 ID grids; classic
AprilTag (Olson 2011, AprilRobotics)Higher detection range; v3 + Tag36h11 standard
ChArUco boardChessboard + ArUco fused; standard camera-calibration target
Reflective spherical markers (Vicon, OptiTrack)Motion-capture grade; sub-mm accuracy
STag (Benligiray 2019)More robust to perspective + scale

5.3 Feature-extraction libraries

LibraryDetectorsLicense
OpenCV (C++ / Python)ORB, SIFT (free since 2020), AKAZE, BRISK, Harris, GFTT, SURF (non-free)Apache 2 (some Patents)
OpenGV (Kneip)Geometric verification + PnPBSD
SuperPoint (Magic Leap 2018)Deep keypointsMIT
LightGlue (Lindenberger 2023)Faster successor to SuperGlueApache 2
DINOv2 (Meta 2023)Self-supervised dense featuresApache 2
LoFTR (Sun 2021)Detector-free matchingApache 2
FoundationPose (NVIDIA 2024)One-shot 6D pose estimationNVIDIA / open
ViSP (INRIA, since 1996)Reference visual-servoing toolboxGPL
ARToolKit / ARToolKitXClassical AR trackingLGPL

5.4 Robot integration

StackVS support
ROS 2 + vision_vispBridge ROS images ↔ ViSP servo
MoveIt 2 + visual servoing pluginsLimited; usually direct interfacing
Franka FCI + libfrankaDirect Cartesian velocity command interface
Universal Robots RTDE125 Hz Cartesian / joint velocity command
KUKA KSS / Sunrise FRI250 Hz / 1 kHz interfaces
ABB EGM (Externally Guided Motion)250 Hz

5.5 Camera calibration toolchains

- OpenCV: cv2.calibrateCamera (Zhang's method, charuco/checkerboard)
- ROS camera_calibration (built on OpenCV)
- Kalibr (Furgale, ETH) — supports camera-IMU, multi-camera, multi-modal
- COLMAP — structure-from-motion, useful for extrinsic recovery
- Multical (HKUST) — multi-camera rig calibration
- AprilGrid (AprilRobotics) — calibration target for Kalibr
- Vicalib (NASA Ames) — used for Astrobee NavCam calibration
- Matlab Camera Calibration Toolbox (Bouguet) — historical reference

5.6 Hand-eye solver implementations

- OpenCV cv2.calibrateHandEye:
    method: TSAI / PARK / HORAUD / ANDREFF / DANIILIDIS
- ROS easy_handeye package (wraps OpenCV)
- visp.hand_to_eye_calibration (ViSP)
- HandEyeCalibration (MATLAB Robotics Toolbox)
- Custom CV+SE(3) optimization (Pinocchio, Sophus, Manif libraries)

6. Reference data

6.1 Method comparison

SchemeTrajectoryCalibrationLocal minimaSpeed
IBVS (point)Curved in 3DCoarseYes (Chaumette retreat)Fast
IBVS (line / circle / moment)Better behavedCoarseLessSlower
PBVSStraight in 3DExact extrinsic + intrinsic + modelNone (in pose space)Moderate
2.5D hybridGood in both spacesPartialNoneModerate
Direct (photometric)SmoothIntrinsic onlyTexture-dependentFast
Deep VSLearnedNone (in some variants)Network-dependentModerate

6.2 Notable visual-servoing milestones

YearProjectContribution
1979Shirai-InoueFirst closed-loop vision robot
1980Hill-ParkBlock-stacking robot at SRI
1991Hashimoto et al.First systematic IBVS framework
1992Espiau, Chaumette, Rives”Task function” formulation
1996Hutchinson, Hager, Corke”Tutorial on Visual Servo Control” (IEEE TRA)
1999Malis, Chaumette, Boudet2.5D hybrid VS
2001Corke & HutchinsonPartitioned IBVS
2006/2007Chaumette, HutchinsonTwo-part Tutorial in IEEE RAM (modern reference)
2011AprilTag (Olson)Robust fiducial standard
2018Bateux et al.Deep VS with CNN features
2020Manuelli et al.Keypoint-Affordance with dense descriptors
2022Florence et al.NeRF-supervised servoing
2024NVIDIA FoundationPoseOne-shot 6D pose for any object

6.3 ViSP example task code (pseudo)

vpServo servo;
servo.setServoType(vpServo::EYE_IN_HAND);
servo.setInteractionMatrixType(vpServo::CURRENT);  // recompute L each step
 
// Define 4 point features
vpFeaturePoint p[4], pd[4];  // current, desired
for (int i = 0; i < 4; i++) {
    p[i].buildFrom(u[i], v[i], Z[i]);
    pd[i].buildFrom(u_des[i], v_des[i], Z_des[i]);
    servo.addFeature(p[i], pd[i]);
}
 
servo.setLambda(0.5);  // adaptive: servo.setLambda(vpAdaptiveGain(4.0, 0.4, 30.0));
 
while (!converged) {
    grab_image();
    track_features();      // update u[i], v[i]
    for (int i = 0; i < 4; i++) p[i].buildFrom(u[i], v[i], Z[i]);
    vpColVector v = servo.computeControlLaw();
    send_to_robot(v);      // 6D camera twist
}

6.4 Stability + convergence properties

Initial error magnitudeIBVSPBVSHybrid
Small (< 10° rotation, < 10 cm trans)Stable, fastStable, fastStable, fast
Moderate (30°, 30 cm)Slow / oscillatoryStableStable
Large (90°+)Risk: feature exit FOVRisk: 3D pose driftBest (rotation in pose, trans in image)
Approaching singularityL conditioning poorNot affectedMixed
Featureless surfaceFails (no features)Fails (no PnP)Fails; use direct VS

7. Failure modes & debugging

  • Features exit FOV during convergence — common IBVS failure. Detect: tracker loses point. Fix: more features (corners + center), partitioned control, or pre-plan a feasible approach.
  • Singular interaction matrix — features collinear or co-planar in degenerate ways. Detect: condition number of L spikes. Fix: re-position camera; add diverse features (line + point + circle).
  • Hand-eye calibration error — VS converges to wrong pose offset. Detect: residual after VS settles. Fix: recalibrate; verify checkerboard images; use Tsai-Lenz or dual-quaternion methods.
  • Lighting change mid-task — detection threshold misclassifies features. Fix: auto-exposure off, fixed gain, controlled lighting; or feature-invariant descriptors (ORB > Harris).
  • Z (depth) estimation drift — IBVS gain becomes wrong as Z error grows. Fix: use measured Z from depth sensor; or adapt Z from feature motion.
  • Latency-induced oscillation — vision is 30 Hz, robot at 1 kHz, controller lags. Fix: motion compensation between image samples; predict via IMU; reduce gain λ.
  • Outlier features — single mis-detected corner biases the pseudo-inverse. Fix: RANSAC on features, or weighted least squares with M-estimator (Huber, Tukey).
  • Chaumette retreat — camera moves backward to align features. Fix: switch to 2.5D or partitioned control; use distance threshold to gate IBVS.
  • Camera saturation / motion blur — fast motion produces blurred features. Fix: shorter exposure (and brighter lighting); higher gain; trigger on hardware sync; or use event cameras (DAVIS, Prophesee).
  • Marker tag occlusion — ArUco partly hidden by gripper. Fix: multiple markers, redundancy; or switch to texture-based tracking.

7.1 Adaptive gain schemes

Standard fixed-λ control: v = −λ · L⁺ · e. Adaptive variants:

  • Step-by-step: λ = λ_max when ||e|| < threshold; λ = λ_min when far. Avoids saturation at large error.
  • Exponential: λ = λ_∞ + (λ_0 − λ_∞) · exp(−k ||e||). ViSP’s vpAdaptiveGain standard pattern.
  • State-dependent: λ derived from current camera-target distance (PBVS only).

7.2 Singularity-robust pseudo-inverse

When L is near-singular, naive pseudo-inverse blows up. Use damped least-squares (DLS / Levenberg-Marquardt):

L⁺_dls = (L^T L + α² I)⁻¹ L

with α = damping factor (0.01–0.1 typical). Or compute via SVD with small-singular-value truncation. Standard in production VS.

8. Case studies

8.1 Universal Robots peg-in-hole (commercial)

Universal Robots’ “URCap Force Copilot” + camera package implements eye-in-hand VS + force feedback for peg insertion. Workflow:

  1. Vision-based approach to ±2 mm tolerance using AprilTag on socket.
  2. Force-controlled insertion with impedance control (see impedance-control) at low Z stiffness.
  3. Tolerance: ~0.1 mm final alignment for round pegs, ~0.05 mm for D-shaped.

UR uses IBVS with the wrist-mounted RealSense D405 (short-range, 7–50 cm). Standard production deployment across automotive sub-assembly.

8.2 Surgical da Vinci with vision feedback

Intuitive Surgical’s da Vinci system: 3-arm tele-operated surgical robot with stereo endoscope. While the surgeon controls primary motion via master haptic devices, autonomous sub-tasks (knot tying, suture pickup) use VS with stereo-derived 3D features. The “Smart Tissue Autonomous Robot” (STAR, Krieger 2020) demonstrated supervised autonomous suturing on porcine intestine with VS-based alignment. Real eye-in-hand VS with anatomical features as the targets — sutures are tracked frame-to-frame with bespoke detectors.

8.3 Hutchinson tutorial benchmark (Robotnik / INRIA ViSP)

ViSP (Visual Servoing Platform, INRIA) ships with reference examples that became the de facto teaching demos:

  • 4-point IBVS planar
  • Sphere IBVS (3D feature)
  • Cylinder IBVS (line feature)
  • Eye-in-hand PBVS on planar target
  • 2.5D hybrid

All run on a Franka Panda or UR5 with a wrist camera; reproducible in any university lab.

8.4 NVIDIA FoundationPose + VS for novel objects (2024)

FoundationPose: one-shot 6D pose estimator for any object given a single reference image / CAD. Combined with PBVS:

  1. Show robot a new object once (image or scan).
  2. FoundationPose tracks pose in subsequent frames at 30+ Hz.
  3. PBVS drives gripper to a learned grasp pose relative to the object.

Bypasses the per-object fiducial marker requirement; enables fast deployment in unstructured environments. Open-sourced by NVIDIA in 2024.

8.5 ABB visual servoing on car body — automotive welding

ABB IRB 7600 + camera + laser pointer on a body-in-white welding line. The robot pre-positions to nominal weld point (open-loop from CAD), then visual servoing corrects for body-position variation up to ±15 mm. The vision module measures the seam edge and adjusts the welding TCP (tool-center-point) within 50 ms before each weld initiates. Standard practice across automotive OEMs since ~2005; the controller runs hand-tuned IBVS or industrial PBVS implementation.

8.6 Soft robotics + photometric servoing (Crombez 2017)

Eye-in-hand on a 6-DoF arm tracking a featureless surface. Direct photometric VS uses pixel intensity differences (no feature extraction). The robot aligns its camera parallel to the surface (~1 cm standoff) by minimizing sum-of-squared-differences (SSD) between current image and reference. Worked at 30 Hz on a 320×240 ROI. Useful for surface-inspection tasks where the surface has no distinctive features (painted metal, polished glass).

8.7 Drone visual landing (DJI / Skydio)

Skydio X10D and X2 use eye-down camera + IBVS-style alignment to land on a 1.2 m × 1.2 m landing pad with an AprilTag-derived marker. Approach:

  1. Long-range descent guided by GPS to ~3 m above pad.
  2. Switch to vision-only at < 3 m AGL.
  3. Detect ArUco/Skydio-marker.
  4. IBVS center the marker in the frame while descending.
  5. Touchdown when marker covers ≥ 60% of frame + IMU detects contact.

Used in autonomous return-to-home + docking-on-vehicle (military). DJI Matrice 350 RTK + RTK Beacon implements similar protocol.

8.8 Industrial assembly with hybrid VS + force

Comau / Stäubli / Fanuc industrial cells routinely use:

  • Open-loop traverse to nominal grasp pose (fast, ~50 mm tolerance).
  • Vision-based final approach for ±2 mm correction (camera mounted on Z-axis).
  • Force-feedback insertion for sub-mm alignment (impedance control).

Standard sequence for engine-block bolt-insertion: ~3 s open-loop traverse + 1.5 s VS + 0.5 s force-insertion. Achievable cycle time 5 s with > 99.9% success.

8.9 Pick-and-place with deep features (Mahler 2017 Dex-Net + VS)

UC Berkeley AUTOLAB’s Dex-Net family: trained deep grasp-quality network on synthetic depth images. Pipeline:

  1. RGBD camera fixed overhead.
  2. Dex-Net inference: per-pixel grasp success likelihood + 6-DoF grasp pose.
  3. Robot transitions to predicted grasp via open-loop motion plus optional VS final correction.

Not strictly “VS” but uses vision in the loop with learned features. Standard in academic warehouse research. Extends to ManiSkill / RAPID benchmarks.

Adjacent

Comparison: VS vs alternatives

Open-loop motion (CAD-based plan):
  + Simple, fast
  - Sensitive to calibration drift, fixture variation
  - Used as a coarse approach before VS

Visual servoing:
  + Tolerates kinematic uncertainty, calibration drift
  + Adapts to moving targets
  - Requires sustained visibility of features
  - Vision-loop latency limits bandwidth

Force-only feedback (impedance):
  + Robust to vision failure (lighting, occlusion)
  - Cannot disambiguate before contact

Hybrid (VS + force):
  + Vision for pre-contact alignment
  + Force for post-contact insertion
  - Standard combination in production

Learning-based (end-to-end neural):
  + Generalizes to novel objects (foundation models)
  + Handles texture-less surfaces
  - Less interpretable; difficult to verify
  - Standard for general manipulation 2024+

Citations

  • Hashimoto K., Kimoto T., Ebine T., Kimura H. (1991) “Manipulator Control with Image-Based Visual Servo.” ICRA.
  • Espiau B., Chaumette F., Rives P. (1992) “A New Approach to Visual Servoing in Robotics.” IEEE Trans Robotics & Automation 8(3).
  • Chaumette F. (1998) “Potential Problems of Stability and Convergence in Image-Based and Position-Based Visual Servoing.” in The Confluence of Vision and Control, LNCIS 237.
  • Hutchinson S., Hager G., Corke P. (1996) “A Tutorial on Visual Servo Control.” IEEE Trans Robotics & Automation 12(5).
  • Chaumette F., Hutchinson S. (2006) “Visual Servo Control I: Basic Approaches.” IEEE Robotics & Automation Magazine 13(4).
  • Chaumette F., Hutchinson S. (2007) “Visual Servo Control II: Advanced Approaches.” IEEE Robotics & Automation Magazine 14(1).
  • Malis E., Chaumette F., Boudet S. (1999) “2½D Visual Servoing.” IEEE Trans Robotics & Automation 15(2).
  • Corke P., Hutchinson S. (2001) “A New Partitioned Approach to Image-Based Visual Servo Control.” IEEE Trans Robotics & Automation 17(4).
  • Tsai R.Y., Lenz R.K. (1989) “A New Technique for Fully Autonomous and Efficient 3D Robotics Hand/Eye Calibration.” IEEE Trans Robotics & Automation 5(3).
  • Daniilidis K. (1999) “Hand-Eye Calibration Using Dual Quaternions.” Int J Robotics Research 18(3).
  • Park F.C., Martin B.J. (1994) “Robot Sensor Calibration: Solving AX=XB on the Euclidean Group.” IEEE Trans Robotics & Automation 10(5).
  • Papanikolopoulos N.P., Khosla P.K. (1993) “Adaptive Robotic Visual Tracking: Theory and Experiments.” IEEE Trans Automatic Control 38(3).
  • Bakthavatchalam M., Chaumette F., Marchand E. (2015) “Photometric Moments: New Promising Candidates for Visual Servoing.” ICRA.
  • Crombez N., Mouaddib E.M., Caron G., Chaumette F. (2017) “Direct Visual Servoing Using Spline-Based Photometric Functions.” ICRA.
  • Bateux Q., Marchand E., Leitner J., Chaumette F., Corke P. (2018) “Training Deep Neural Networks for Visual Servoing.” ICRA.
  • Manuelli L., Li Y., Florence P., Tedrake R. (2020) “Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning.” CoRL.
  • Lepetit V., Moreno-Noguer F., Fua P. (2009) “EPnP: An Accurate O(n) Solution to the PnP Problem.” IJCV 81.
  • Olson E. (2011) “AprilTag: A Robust and Flexible Visual Fiducial System.” ICRA.
  • ViSP project (INRIA, Marchand and Chaumette) visp.inria.fr.
  • NVIDIA (2024) “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects.” CVPR 2024 best paper.