Visual Servoing — IBVS, PBVS, Hybrid Schemes, Direct Visual Servoing

Closing the control loop directly through vision: Image-Based (IBVS) using pixel features, Position-Based (PBVS) using estimated 3D pose, hybrid 2.5D schemes, and the modern variants — direct photometric servoing, deep visual servoing with neural feature embeddings, and uncalibrated visual servoing. The classical paper-trail (Chaumette, Hutchinson, Espiau, Hashimoto) plus the contemporary deep-learning grafts that handle texture-less objects and large displacements. Foundational for any robotic task where the final pose must be defined relative to what the camera sees, not what a CAD model predicts.

1. At a glance

Visual servoing (VS) is robot control with vision in the feedback loop. The error signal is built from image measurements — pixel features, region descriptors, or 3D pose estimates — and the controller computes joint or Cartesian commands to drive that error to zero. Unlike “look-then-move” architectures (compute target pose once, then execute open-loop), VS continuously corrects through image feedback, so it tolerates calibration error, kinematic uncertainty, and moving targets that pure planning cannot.

Three canonical schemes dominate the literature and practice:

Image-Based Visual Servoing (IBVS) — error defined in 2D pixel space; needs the interaction matrix (image Jacobian) relating pixel velocities to camera twist. Classical IBVS (Hashimoto 1991, Espiau 1992, Chaumette 1992, Chaumette & Hutchinson 2006/2007 tutorial) is the standard taught in textbooks.
Position-Based Visual Servoing (PBVS) — features are reconstructed into 3D pose via known geometry (or learned estimator); error is defined in Cartesian SE(3). Requires accurate camera calibration + object model.
Hybrid / 2.5D VS (Malis 1999) — decouple translation and rotation; do rotation control in pose space and translation in image space. Best behavior at long distances + accuracy at close.

The 2020s additions:

Direct (photometric) visual servoing — error is pixel intensity differences, no features extracted. Robust to texture-less / featureless objects (Bakthavatchalam 2015, Crombez 2017).
Deep visual servoing — neural feature embeddings as the visual error, trained on either synthetic or self-supervised data (Bateux 2018, Saxena 2017, Yu 2020).
Differentiable rendering VS — gradients through a renderer enable pose-from-image without an explicit pose-estimator (Manuelli 2020, NeRF-supervision 2022).

Where it sits. CV provides the perception input; kinematics provides the robot Jacobian J_r mapping joint velocities to end-effector twist; the interaction matrix L maps camera twist to image-feature velocity; the composite control law is q̇ = J_r⁻¹ · L⁻¹ · (s − s*). Manipulator, gripper, bin picking, and impedance control all build on VS for the final-approach phase.

First ask. Eye-in-hand or eye-to-hand? Eye-in-hand: camera moves with the end-effector — most precision tasks. Eye-to-hand: fixed camera — multi-object overview, less precise but stable. 2D, 2.5D, or 3D feature space? 2D (IBVS): simplest, no model required, but trajectories in 3D can be poor. 3D (PBVS): clean 3D trajectory but needs calibration + model. 2.5D: best of both. Visible features or featureless? Feature-based VS for textured objects with markers/SIFT/ORB; direct VS for surfaces with little texture. How much disparity in initial vs final pose? Small displacements: IBVS converges reliably. Large displacements: PBVS or path-planned approach to a IBVS-feasible zone.

2. First principles

2.1 The interaction matrix (image Jacobian) for a point

For an image feature s = (u, v) corresponding to a 3D point at Z relative to the camera, the velocity of the image point is related to camera-twist v_c = (v_x, v_y, v_z, ω_x, ω_y, ω_z) by:

[u̇]   [ -1/Z    0    u/Z    uv     -(1+u²)    v ]
[v̇] = [   0  -1/Z    v/Z   1+v²     -uv      -u ] · v_c

This 2×6 matrix is L_p (or L_s), the interaction matrix for a single point feature. (Coordinates u, v are normalized image coordinates; multiply by focal length for raw pixels.) For n point features, stack to get L of size 2n×6.

Inverse via pseudo-inverse:

v_c = −λ · L⁺ · (s − s*)

where λ > 0 is the gain, s* is the desired feature configuration, L⁺ = (L^T L)⁻¹ L^T for n ≥ 3 features.

2.2 Three-points minimum

Six DoF needs 6 independent equations → at least 3 point features (giving 6 equations from 6 pixel components). In practice 4+ points are used because:

Point features can be lost (occlusion, leave FOV).
Redundant features improve noise rejection.
Three points alone leave the IBVS scheme prone to local minima (“Chaumette conundrum” — the camera retreats backward to align features that should be approached).

2.3 Stability and the Chaumette conundrum

IBVS with point features has known local minima: configurations where (s − s*) ≠ 0 but L⁺(s − s*) commands camera motion that does not reduce the error globally. The textbook example: features at corners of a square, desired pose perpendicular to the plane. IBVS may command the camera to back away to align before approaching — eventually escaping FOV.

Mitigations:

Use more features (5+).
Switch to PBVS for large initial errors.
Use partitioned control (Corke & Hutchinson 2001) — pull rotation out of IBVS and treat with separate law.
Use 2.5D hybrid (Malis 1999).

2.4 PBVS pose error

Given measured pose (R, t) of the object w.r.t. camera and desired (R*, t*):

e_t = t − R · R*^T · t* (translation error in camera frame) θu = log(R · R*^T) (rotation error as angle-axis vector)

Camera twist command:

v_c = −λ · [e_t; θu]

This is exact in SE(3); the control law is the SO(3) → so(3) logarithm. Trajectories are geodesics in pose space (straight lines + great-arc rotations), which look natural but may drive features out of FOV.

2.5 2.5D hybrid (Malis 1999)

Decompose:

Use IBVS for translation (3 image coordinates of a reference point).
Use PBVS for rotation (axis-angle from estimated R).
Combine in a 6×6 interaction matrix that is block-diagonal at desired pose.

Pros: avoids feature-leaving-FOV (rotation is bounded a priori); avoids Chaumette retreat (translation in image space is well-behaved); needs partial calibration only.

2.6 Direct (photometric) VS

Error is pixel intensity:

e = I(p, q) − I*(p)

summed over a region. The interaction matrix becomes:

L_I = −∇I · L_p

with ∇I the image gradient at pixel p. Sensitive to lighting + viewpoint change but does not need feature extraction. Used in surface inspection, where features are sparse but textures are reliable.

2.7 Deep visual servoing

Replace the hand-crafted feature s with a learned embedding f_θ(I) ∈ ℝ^d, where the interaction matrix is computed via either:

Numeric Jacobian: perturb camera by small ε in 6 directions; compute (f(I_perturbed) − f(I)) / ε. Slow.
Backpropagation through the encoder: ∂f / ∂I × ∂I / ∂v_c with the second factor learned or computed analytically.
Self-supervised contrastive training (Bateux 2018, Yu 2020) so the embedding space is approximately linear in camera twist.

Advantages: robust to lighting / texture / partial occlusion. Disadvantages: less geometric guarantees than classical schemes.

2.8 Robot-Jacobian coupling

Total chain: joint velocities → end-effector twist (via robot Jacobian J_r) → camera twist (via fixed eye-in-hand transform ^cT_e) → image feature velocity (via interaction matrix L).

q̇ = J_r⁻¹ · ^eT_c · L⁺ · (−λ(s − s*))

When eye-in-hand: ^eT_c is the fixed extrinsic. When eye-to-hand: relate camera frame to robot base frame via ^bT_c, calibrate offline. The “hand-eye calibration” problem solves AX=XB to find this extrinsic (Tsai-Lenz 1989, Daniilidis 1999, Park-Martin 1994).

3. Practical math — sized examples + control law

3.1 Worked example: 4-point IBVS to align a planar target

Setup: eye-in-hand camera (640×480, fx=fy=500), 4 corners of a 100×100 mm planar target visible. Target at Z ≈ 0.5 m from camera. Desired image: corners at (200,200), (440,200), (440,360), (200,360) — i.e., target perfectly centered + axis-aligned.

Steps:

Detect 4 corners (Harris / Shi-Tomasi / ArUco). Get current pixel positions (u_i, v_i).
Normalize: u_n = (u − cx) / fx, v_n = (v − cy) / fy.
Build L (8×6) from §2.1 using each corner’s (u_n, v_n) and estimated Z (use depth sensor or known target size).
Compute error e = (s − s*) ∈ ℝ⁸.
v_c = −λ · L⁺ · e, with λ = 0.5.
Transform to end-effector twist: v_e = ^eR_c · v_c (no offset; cameras are usually rigidly mounted).
Joint velocities: q̇ = J_r⁻¹ · v_e.
Send q̇ to controller; iterate at 30–100 Hz.

Expected behavior: corners converge to desired positions in ~50–200 iterations (1–5 s). Trajectory of camera in 3D is a curve, not straight line — image-space straight, 3D-space curved (well-known IBVS property).

3.2 PBVS variant

Same setup, but:

Estimate 3D pose of target via PnP (Lepetit’s EPnP, OpenCV solvePnP) using 4 correspondences.
e_t = t_current − R_current · R_desired^T · t_desired.
θu = log(R_current · R_desired^T).
v_c = −λ · [e_t; θu].
Same transformations through robot Jacobian.

Trajectory of camera is straight line in 3D — corners take curved paths in image, may leave FOV near boundaries.

3.3 Adaptive Z (depth) estimation

The interaction matrix depends on Z. Three strategies:

Constant Z (Z = Z*): use the desired-pose depth. Works for small displacements; fails for large.
Online estimation: depth sensor (RealSense, ToF) gives Z directly per feature.
Z estimated from feature motion (Papanikolopoulos 1995): with measured image velocity ṡ and commanded camera velocity v_c, solve for Z from L_p(Z) · v_c = ṡ.

3.4 Hand-eye calibration (Tsai-Lenz)

Given multiple (^bT_e, ^cT_obj) pairs, solve:

AX = XB

where A = ^bT_{e,i}^{-1} · ^bT_{e,j}, B = ^cT_{obj,i} · ^cT_{obj,j}^{-1}, X = ^eT_c.

Tsai-Lenz two-step solution:

Solve rotation: R_x from R_a · R_x = R_x · R_b using axis-angle ratio.
Solve translation: t_x from (R_a − I) · t_x = R_x · t_b − t_a.

Modern alternative: dual-quaternion formulation (Daniilidis 1999), solve as a single linear system. OpenCV provides calibrateHandEye() with multiple methods.

3.5 Latency budget

Vision pipeline at 30 Hz: capture (~5 ms exposure + 10 ms readout) + detection (5–30 ms) + tracking (1–5 ms) + total ~40–60 ms per frame. Joint command at 1 kHz. The VS loop is rate-limited by vision. For high-bandwidth tasks (peg-in-hole final approach), use predictive control + IMU motion compensation.

4. Design heuristics

Eye-in-hand for precision; eye-to-hand for context. Eye-in-hand gives sub-pixel feedback at close range. Eye-to-hand sees the whole workspace but is poor for small alignment.
Mark your target. ArUco / AprilTag fiducials remove detection ambiguity. Used in production robotics + research. Camera calibration via charuco boards is standard practice.
Use enough features. 4 minimum, 6–10 typical, 20+ for robust real-time tracking. Spread across image to maximize observability — features all in one quadrant gives a poorly-conditioned L.
Mix feature types. Points + lines + circles. Different features have different sensitivities to motion directions; combining stabilizes.
Don’t normalize the camera too late. Pixel coordinates have non-uniform sensitivity (corner pixels at depth Z have different metric meaning than center). Always work in normalized image coordinates inside the controller.
Saturate the gain. λ too high → instability + servo overshoot. λ too low → slow convergence. Typical λ = 0.3–1.0 for 30 Hz vision; tune by step response.
Use velocity feedforward. If target is moving (tracked over time), include estimated target velocity in the command: v_c = −λ · L⁺ · (s − s*) + L⁺ · ṡ*.
Switch schemes by distance. PBVS or planned approach when far; IBVS when close. Cuts trajectory length while preserving final-pose accuracy.
Plan path through feature space, not configuration space. For VS at the final phase, the feature path determines whether features stay in FOV. Mellinger-style path-planning over the feature manifold is now a niche but useful technique.
Filter your features. Detected pixel positions are noisy. Low-pass filter or Kalman filter the features before computing error.

5. Components & sourcing — vision + feature libraries

5.1 Camera + sensor sourcing

Vendor / model	Resolution	Frame rate	Notes
FLIR Blackfly S USB3 (BFS-U3-13Y3C)	1280×1024	170 fps	Industrial workhorse
Basler ace 2 USB3 / GigE (a2A2590-60ucBAS)	2592×2048	60 fps	Wide industrial use
Intel RealSense D435i / D455	1280×720 RGB + depth	90 fps	Built-in IMU
Intel RealSense D405	1280×720 stereo	90 fps	Short-range (7–50 cm) for in-hand
Stereolabs Zed 2i / Zed X	2K stereo	60 fps	Outdoor + neural depth
Luxonis OAK-D / OAK-D-Pro	1080p RGB + stereo depth	30 fps	Onboard Movidius VPU
Ximea xiQ MQ013MG	1280×1024	60 fps	Subminiature for in-hand
Sony Spresense + IMX458	1080p	60 fps	DIY/research

5.2 Fiducial / marker systems

Marker	Notes
ArUco (OpenCV `aruco` module)	4×4 to 7×7 ID grids; classic
AprilTag (Olson 2011, AprilRobotics)	Higher detection range; v3 + Tag36h11 standard
ChArUco board	Chessboard + ArUco fused; standard camera-calibration target
Reflective spherical markers (Vicon, OptiTrack)	Motion-capture grade; sub-mm accuracy
STag (Benligiray 2019)	More robust to perspective + scale

5.3 Feature-extraction libraries

Library	Detectors	License
OpenCV (C++ / Python)	ORB, SIFT (free since 2020), AKAZE, BRISK, Harris, GFTT, SURF (non-free)	Apache 2 (some Patents)
OpenGV (Kneip)	Geometric verification + PnP	BSD
SuperPoint (Magic Leap 2018)	Deep keypoints	MIT
LightGlue (Lindenberger 2023)	Faster successor to SuperGlue	Apache 2
DINOv2 (Meta 2023)	Self-supervised dense features	Apache 2
LoFTR (Sun 2021)	Detector-free matching	Apache 2
FoundationPose (NVIDIA 2024)	One-shot 6D pose estimation	NVIDIA / open
ViSP (INRIA, since 1996)	Reference visual-servoing toolbox	GPL
ARToolKit / ARToolKitX	Classical AR tracking	LGPL

5.4 Robot integration

Stack	VS support
ROS 2 + `vision_visp`	Bridge ROS images ↔ ViSP servo
MoveIt 2 + visual servoing plugins	Limited; usually direct interfacing
Franka FCI + libfranka	Direct Cartesian velocity command interface
Universal Robots RTDE	125 Hz Cartesian / joint velocity command
KUKA KSS / Sunrise FRI	250 Hz / 1 kHz interfaces
ABB EGM (Externally Guided Motion)	250 Hz

5.5 Camera calibration toolchains

- OpenCV: cv2.calibrateCamera (Zhang's method, charuco/checkerboard)
- ROS camera_calibration (built on OpenCV)
- Kalibr (Furgale, ETH) — supports camera-IMU, multi-camera, multi-modal
- COLMAP — structure-from-motion, useful for extrinsic recovery
- Multical (HKUST) — multi-camera rig calibration
- AprilGrid (AprilRobotics) — calibration target for Kalibr
- Vicalib (NASA Ames) — used for Astrobee NavCam calibration
- Matlab Camera Calibration Toolbox (Bouguet) — historical reference

5.6 Hand-eye solver implementations

- OpenCV cv2.calibrateHandEye:
    method: TSAI / PARK / HORAUD / ANDREFF / DANIILIDIS
- ROS easy_handeye package (wraps OpenCV)
- visp.hand_to_eye_calibration (ViSP)
- HandEyeCalibration (MATLAB Robotics Toolbox)
- Custom CV+SE(3) optimization (Pinocchio, Sophus, Manif libraries)

6. Reference data

6.1 Method comparison

Scheme	Trajectory	Calibration	Local minima	Speed
IBVS (point)	Curved in 3D	Coarse	Yes (Chaumette retreat)	Fast
IBVS (line / circle / moment)	Better behaved	Coarse	Less	Slower
PBVS	Straight in 3D	Exact extrinsic + intrinsic + model	None (in pose space)	Moderate
2.5D hybrid	Good in both spaces	Partial	None	Moderate
Direct (photometric)	Smooth	Intrinsic only	Texture-dependent	Fast
Deep VS	Learned	None (in some variants)	Network-dependent	Moderate

6.2 Notable visual-servoing milestones

Year	Project	Contribution
1979	Shirai-Inoue	First closed-loop vision robot
1980	Hill-Park	Block-stacking robot at SRI
1991	Hashimoto et al.	First systematic IBVS framework
1992	Espiau, Chaumette, Rives	”Task function” formulation
1996	Hutchinson, Hager, Corke	”Tutorial on Visual Servo Control” (IEEE TRA)
1999	Malis, Chaumette, Boudet	2.5D hybrid VS
2001	Corke & Hutchinson	Partitioned IBVS
2006/2007	Chaumette, Hutchinson	Two-part Tutorial in IEEE RAM (modern reference)
2011	AprilTag (Olson)	Robust fiducial standard
2018	Bateux et al.	Deep VS with CNN features
2020	Manuelli et al.	Keypoint-Affordance with dense descriptors
2022	Florence et al.	NeRF-supervised servoing
2024	NVIDIA FoundationPose	One-shot 6D pose for any object

6.3 ViSP example task code (pseudo)

vpServo servo;
servo.setServoType(vpServo::EYE_IN_HAND);
servo.setInteractionMatrixType(vpServo::CURRENT);  // recompute L each step
 
// Define 4 point features
vpFeaturePoint p[4], pd[4];  // current, desired
for (int i = 0; i < 4; i++) {
    p[i].buildFrom(u[i], v[i], Z[i]);
    pd[i].buildFrom(u_des[i], v_des[i], Z_des[i]);
    servo.addFeature(p[i], pd[i]);
}
 
servo.setLambda(0.5);  // adaptive: servo.setLambda(vpAdaptiveGain(4.0, 0.4, 30.0));
 
while (!converged) {
    grab_image();
    track_features();      // update u[i], v[i]
    for (int i = 0; i < 4; i++) p[i].buildFrom(u[i], v[i], Z[i]);
    vpColVector v = servo.computeControlLaw();
    send_to_robot(v);      // 6D camera twist
}

6.4 Stability + convergence properties

Initial error magnitude	IBVS	PBVS	Hybrid
Small (< 10° rotation, < 10 cm trans)	Stable, fast	Stable, fast	Stable, fast
Moderate (30°, 30 cm)	Slow / oscillatory	Stable	Stable
Large (90°+)	Risk: feature exit FOV	Risk: 3D pose drift	Best (rotation in pose, trans in image)
Approaching singularity	L conditioning poor	Not affected	Mixed
Featureless surface	Fails (no features)	Fails (no PnP)	Fails; use direct VS

7. Failure modes & debugging

Features exit FOV during convergence — common IBVS failure. Detect: tracker loses point. Fix: more features (corners + center), partitioned control, or pre-plan a feasible approach.
Singular interaction matrix — features collinear or co-planar in degenerate ways. Detect: condition number of L spikes. Fix: re-position camera; add diverse features (line + point + circle).
Hand-eye calibration error — VS converges to wrong pose offset. Detect: residual after VS settles. Fix: recalibrate; verify checkerboard images; use Tsai-Lenz or dual-quaternion methods.
Lighting change mid-task — detection threshold misclassifies features. Fix: auto-exposure off, fixed gain, controlled lighting; or feature-invariant descriptors (ORB > Harris).
Z (depth) estimation drift — IBVS gain becomes wrong as Z error grows. Fix: use measured Z from depth sensor; or adapt Z from feature motion.
Latency-induced oscillation — vision is 30 Hz, robot at 1 kHz, controller lags. Fix: motion compensation between image samples; predict via IMU; reduce gain λ.
Outlier features — single mis-detected corner biases the pseudo-inverse. Fix: RANSAC on features, or weighted least squares with M-estimator (Huber, Tukey).
Chaumette retreat — camera moves backward to align features. Fix: switch to 2.5D or partitioned control; use distance threshold to gate IBVS.
Camera saturation / motion blur — fast motion produces blurred features. Fix: shorter exposure (and brighter lighting); higher gain; trigger on hardware sync; or use event cameras (DAVIS, Prophesee).
Marker tag occlusion — ArUco partly hidden by gripper. Fix: multiple markers, redundancy; or switch to texture-based tracking.

7.1 Adaptive gain schemes

Standard fixed-λ control: v = −λ · L⁺ · e. Adaptive variants:

Step-by-step: λ = λ_max when ||e|| < threshold; λ = λ_min when far. Avoids saturation at large error.
Exponential: λ = λ_∞ + (λ_0 − λ_∞) · exp(−k ||e||). ViSP’s vpAdaptiveGain standard pattern.
State-dependent: λ derived from current camera-target distance (PBVS only).

7.2 Singularity-robust pseudo-inverse

When L is near-singular, naive pseudo-inverse blows up. Use damped least-squares (DLS / Levenberg-Marquardt):

L⁺_dls = (L^T L + α² I)⁻¹ L

with α = damping factor (0.01–0.1 typical). Or compute via SVD with small-singular-value truncation. Standard in production VS.

8. Case studies

8.1 Universal Robots peg-in-hole (commercial)

Universal Robots’ “URCap Force Copilot” + camera package implements eye-in-hand VS + force feedback for peg insertion. Workflow:

Vision-based approach to ±2 mm tolerance using AprilTag on socket.
Force-controlled insertion with impedance control (see impedance-control) at low Z stiffness.
Tolerance: ~0.1 mm final alignment for round pegs, ~0.05 mm for D-shaped.

UR uses IBVS with the wrist-mounted RealSense D405 (short-range, 7–50 cm). Standard production deployment across automotive sub-assembly.

8.2 Surgical da Vinci with vision feedback

Intuitive Surgical’s da Vinci system: 3-arm tele-operated surgical robot with stereo endoscope. While the surgeon controls primary motion via master haptic devices, autonomous sub-tasks (knot tying, suture pickup) use VS with stereo-derived 3D features. The “Smart Tissue Autonomous Robot” (STAR, Krieger 2020) demonstrated supervised autonomous suturing on porcine intestine with VS-based alignment. Real eye-in-hand VS with anatomical features as the targets — sutures are tracked frame-to-frame with bespoke detectors.

8.3 Hutchinson tutorial benchmark (Robotnik / INRIA ViSP)

ViSP (Visual Servoing Platform, INRIA) ships with reference examples that became the de facto teaching demos:

4-point IBVS planar
Sphere IBVS (3D feature)
Cylinder IBVS (line feature)
Eye-in-hand PBVS on planar target
2.5D hybrid

All run on a Franka Panda or UR5 with a wrist camera; reproducible in any university lab.

8.4 NVIDIA FoundationPose + VS for novel objects (2024)

FoundationPose: one-shot 6D pose estimator for any object given a single reference image / CAD. Combined with PBVS:

Show robot a new object once (image or scan).
FoundationPose tracks pose in subsequent frames at 30+ Hz.
PBVS drives gripper to a learned grasp pose relative to the object.

Bypasses the per-object fiducial marker requirement; enables fast deployment in unstructured environments. Open-sourced by NVIDIA in 2024.

8.5 ABB visual servoing on car body — automotive welding

ABB IRB 7600 + camera + laser pointer on a body-in-white welding line. The robot pre-positions to nominal weld point (open-loop from CAD), then visual servoing corrects for body-position variation up to ±15 mm. The vision module measures the seam edge and adjusts the welding TCP (tool-center-point) within 50 ms before each weld initiates. Standard practice across automotive OEMs since ~2005; the controller runs hand-tuned IBVS or industrial PBVS implementation.

8.6 Soft robotics + photometric servoing (Crombez 2017)

Eye-in-hand on a 6-DoF arm tracking a featureless surface. Direct photometric VS uses pixel intensity differences (no feature extraction). The robot aligns its camera parallel to the surface (~1 cm standoff) by minimizing sum-of-squared-differences (SSD) between current image and reference. Worked at 30 Hz on a 320×240 ROI. Useful for surface-inspection tasks where the surface has no distinctive features (painted metal, polished glass).

8.7 Drone visual landing (DJI / Skydio)

Skydio X10D and X2 use eye-down camera + IBVS-style alignment to land on a 1.2 m × 1.2 m landing pad with an AprilTag-derived marker. Approach:

Long-range descent guided by GPS to ~3 m above pad.
Switch to vision-only at < 3 m AGL.
Detect ArUco/Skydio-marker.
IBVS center the marker in the frame while descending.
Touchdown when marker covers ≥ 60% of frame + IMU detects contact.

Used in autonomous return-to-home + docking-on-vehicle (military). DJI Matrice 350 RTK + RTK Beacon implements similar protocol.

8.8 Industrial assembly with hybrid VS + force

Comau / Stäubli / Fanuc industrial cells routinely use:

Open-loop traverse to nominal grasp pose (fast, ~50 mm tolerance).
Vision-based final approach for ±2 mm correction (camera mounted on Z-axis).
Force-feedback insertion for sub-mm alignment (impedance control).

Standard sequence for engine-block bolt-insertion: ~3 s open-loop traverse + 1.5 s VS + 0.5 s force-insertion. Achievable cycle time 5 s with > 99.9% success.

8.9 Pick-and-place with deep features (Mahler 2017 Dex-Net + VS)

UC Berkeley AUTOLAB’s Dex-Net family: trained deep grasp-quality network on synthetic depth images. Pipeline:

RGBD camera fixed overhead.
Dex-Net inference: per-pixel grasp success likelihood + 6-DoF grasp pose.
Robot transitions to predicted grasp via open-loop motion plus optional VS final correction.

Not strictly “VS” but uses vision in the loop with learned features. Standard in academic warehouse research. Extends to ManiSkill / RAPID benchmarks.

Adjacent

Comparison: VS vs alternatives

Open-loop motion (CAD-based plan):
  + Simple, fast
  - Sensitive to calibration drift, fixture variation
  - Used as a coarse approach before VS

Visual servoing:
  + Tolerates kinematic uncertainty, calibration drift
  + Adapts to moving targets
  - Requires sustained visibility of features
  - Vision-loop latency limits bandwidth

Force-only feedback (impedance):
  + Robust to vision failure (lighting, occlusion)
  - Cannot disambiguate before contact

Hybrid (VS + force):
  + Vision for pre-contact alignment
  + Force for post-contact insertion
  - Standard combination in production

Learning-based (end-to-end neural):
  + Generalizes to novel objects (foundation models)
  + Handles texture-less surfaces
  - Less interpretable; difficult to verify
  - Standard for general manipulation 2024+

Citations

Hashimoto K., Kimoto T., Ebine T., Kimura H. (1991) “Manipulator Control with Image-Based Visual Servo.” ICRA.
Espiau B., Chaumette F., Rives P. (1992) “A New Approach to Visual Servoing in Robotics.” IEEE Trans Robotics & Automation 8(3).
Chaumette F. (1998) “Potential Problems of Stability and Convergence in Image-Based and Position-Based Visual Servoing.” in The Confluence of Vision and Control, LNCIS 237.
Hutchinson S., Hager G., Corke P. (1996) “A Tutorial on Visual Servo Control.” IEEE Trans Robotics & Automation 12(5).
Chaumette F., Hutchinson S. (2006) “Visual Servo Control I: Basic Approaches.” IEEE Robotics & Automation Magazine 13(4).
Chaumette F., Hutchinson S. (2007) “Visual Servo Control II: Advanced Approaches.” IEEE Robotics & Automation Magazine 14(1).
Malis E., Chaumette F., Boudet S. (1999) “2½D Visual Servoing.” IEEE Trans Robotics & Automation 15(2).
Corke P., Hutchinson S. (2001) “A New Partitioned Approach to Image-Based Visual Servo Control.” IEEE Trans Robotics & Automation 17(4).
Tsai R.Y., Lenz R.K. (1989) “A New Technique for Fully Autonomous and Efficient 3D Robotics Hand/Eye Calibration.” IEEE Trans Robotics & Automation 5(3).
Daniilidis K. (1999) “Hand-Eye Calibration Using Dual Quaternions.” Int J Robotics Research 18(3).
Park F.C., Martin B.J. (1994) “Robot Sensor Calibration: Solving AX=XB on the Euclidean Group.” IEEE Trans Robotics & Automation 10(5).
Papanikolopoulos N.P., Khosla P.K. (1993) “Adaptive Robotic Visual Tracking: Theory and Experiments.” IEEE Trans Automatic Control 38(3).
Bakthavatchalam M., Chaumette F., Marchand E. (2015) “Photometric Moments: New Promising Candidates for Visual Servoing.” ICRA.
Crombez N., Mouaddib E.M., Caron G., Chaumette F. (2017) “Direct Visual Servoing Using Spline-Based Photometric Functions.” ICRA.
Bateux Q., Marchand E., Leitner J., Chaumette F., Corke P. (2018) “Training Deep Neural Networks for Visual Servoing.” ICRA.
Manuelli L., Li Y., Florence P., Tedrake R. (2020) “Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning.” CoRL.
Lepetit V., Moreno-Noguer F., Fua P. (2009) “EPnP: An Accurate O(n) Solution to the PnP Problem.” IJCV 81.
Olson E. (2011) “AprilTag: A Robust and Flexible Visual Fiducial System.” ICRA.
ViSP project (INRIA, Marchand and Chaumette) visp.inria.fr.
NVIDIA (2024) “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects.” CVPR 2024 best paper.

Compendium

Explorer

Visual Servoing — IBVS, PBVS, Hybrid Schemes, Direct Visual Servoing

Visual Servoing — IBVS, PBVS, Hybrid Schemes, Direct Visual Servoing

See also

1. At a glance

2. First principles

2.1 The interaction matrix (image Jacobian) for a point

2.2 Three-points minimum

2.3 Stability and the Chaumette conundrum

2.4 PBVS pose error

2.5 2.5D hybrid (Malis 1999)

2.6 Direct (photometric) VS

2.7 Deep visual servoing

2.8 Robot-Jacobian coupling

3. Practical math — sized examples + control law

3.1 Worked example: 4-point IBVS to align a planar target

3.2 PBVS variant

3.3 Adaptive Z (depth) estimation

3.4 Hand-eye calibration (Tsai-Lenz)

3.5 Latency budget

4. Design heuristics

5. Components & sourcing — vision + feature libraries

5.1 Camera + sensor sourcing

5.2 Fiducial / marker systems

5.3 Feature-extraction libraries

5.4 Robot integration

5.5 Camera calibration toolchains

5.6 Hand-eye solver implementations

6. Reference data

6.1 Method comparison

6.2 Notable visual-servoing milestones

6.3 ViSP example task code (pseudo)

6.4 Stability + convergence properties

7. Failure modes & debugging

7.1 Adaptive gain schemes

7.2 Singularity-robust pseudo-inverse

8. Case studies

8.1 Universal Robots peg-in-hole (commercial)

8.2 Surgical da Vinci with vision feedback

8.3 Hutchinson tutorial benchmark (Robotnik / INRIA ViSP)

8.4 NVIDIA FoundationPose + VS for novel objects (2024)

8.5 ABB visual servoing on car body — automotive welding

8.6 Soft robotics + photometric servoing (Crombez 2017)

8.7 Drone visual landing (DJI / Skydio)

8.8 Industrial assembly with hybrid VS + force

8.9 Pick-and-place with deep features (Mahler 2017 Dex-Net + VS)

Adjacent

Comparison: VS vs alternatives

Citations

Graph View

Table of Contents