ROS 2 Architecture & Runtime — Robotics Reference

Scope. ROS 2 as a runtime system — nodes, executors, the DDS middleware, QoS, lifecycle, composition, the launch system, workspaces, and the operational gotchas that decide whether a robot ships. The DSL surface (URDF, Xacro, behaviour-tree XML, launch.py) lives in [[Languages/Tier3/ros2-robotics-config]] and [[Languages/Tier3/robotics-control]]. ros2_control’s controller plugins, command/state interfaces, and the FOC-adjacent current loops sit in [[Robotics/pid-control]]. This note is the systems spine that the other Robotics notes plug into.

1. At a glance

ROS 2 = open-source robotics middleware + tools + ecosystem. Replaces ROS 1 (Melodic 2018, Noetic 2020, EOL May 2025) with a DDS-based, real-time-capable, production-grade architecture. Maintained by Open Robotics → Open Source Robotics Foundation, with industry working groups (Apex.AI, Bosch, eProsima, iRobot, Nvidia, PickNik). Stable on Ubuntu (primary), Windows 10/11, macOS, and embedded Linux (Yocto, NXP, Raspberry Pi).

As of 2026-05, the active distro landscape:

DistroReleaseEOLLTS?Notes
Foxy FitzroyMay 2020May 2023yesdead; legacy projects only
GalacticMay 2021Dec 2022nodead
Humble HawksbillMay 2022May 2027yesstill the most-deployed LTS in industry
Iron IrwiniMay 2023Nov 2024nodead; many docs still cite it
Jazzy JaliscoMay 2024May 2029yescurrent “safe” LTS for new builds
Kilted KaijuJun 2025May 2027nobleeding-edge non-LTS
Lyrical LuthMay 2026May 2031yeslatest LTS, default for new projects in 2026

ROS 2 inherited ROS 1’s concepts but threw out its implementation:

  • No roscore — discovery is peer-to-peer via DDS (or a Discovery Server).
  • Topics, services, parameters retained; actions rewritten on top of topics+services.
  • C++ and Python first-class via rclcpp / rclpy; both built on the same C-level rcl layer.
  • Build system: colcon (Python) replaces catkin_make; ament_cmake / ament_python are the package conventions.
  • Real-time path — preemptable kernel, lock-free executors, pre-allocated memory, zero-copy intra-process.
  • Multi-machine, multi-vendor middleware — Cyclone DDS, Fast DDS, RTI Connext, Iceoryx shm.
  • Lifecycle nodes (REP 2007) — explicit state machine for safe init / shutdown.
  • Security — DDS-Security plugin: TLS, authentication, access control.

Where it sits in the design stack. Sensors → drivers (ROS 2 publishers) → state estimation (robot_localization, SLAM nodes) → planner (Nav2, MoveIt 2) → controller (ros2_control + custom controllers) → hardware abstraction (hardware_interface) → motor drives. tf2 weaves transforms across all of it. rosbag2 records everything for replay.

First ask before applying ROS 2:

  1. Distro? Lyrical Luth (LTS, 2026–2031) for new projects; Humble (LTS until 2027) if your platform already runs it.
  2. RMW? Cyclone DDS (default, BSD, broad), Fast DDS (eProsima, official partner), Connext (commercial, certified), or Iceoryx (shm, tightest latency).
  3. Real-time? PREEMPT_RT kernel + isolated CPUs + SCHED_FIFO + pre-allocated memory + multi-threaded executor.
  4. Multi-machine? Same RMW on all hosts, identical ROS_DOMAIN_ID, firewall open for DDS discovery (UDP 7400+).
  5. Embedded? micro-ROS (DDS-XRCE) on STM32 / Renesas RA / ESP32 / Zephyr, agent on a Linux host.
  6. Safety/regulated? Apex.OS (commercial certified fork), or DDS-Security + lifecycle + redundancy patterns.

2. First principles

Nodes and processes

A node is the unit of computation. A node owns publishers, subscribers, service servers/clients, action servers/clients, parameters, and a clock. Multiple nodes can live in one OS process (ComponentContainer / composition), each with its own rclcpp::Node instance. The node-to-process map is independent of the topic graph — that is what makes ROS 2 “composable.”

Topics — anonymous pub-sub

A topic is a typed multicast bus identified by a string name (/cmd_vel) and a message type (geometry_msgs/msg/Twist). Publishers and subscribers do not know about each other; DDS discovery matches them by name + type + compatible QoS. Many-to-many is supported (multiple publishers on the same topic, multiple subscribers).

Message types are defined in .msg IDL → generated C++/Python at build time. ROS 2’s rosidl toolchain emits both the type-support library and serializers per RMW.

Services — synchronous request/response

A service is a one-to-one (per-client-per-server) call: /get_state, /set_parameters. Request and response are separate typed messages. Servers return synchronously in their executor thread; clients can wait or use a future. Use services for idempotent, fast calls; never for long-running work — that’s what actions are for.

Actions — long-running goal/feedback/result

An action wraps the goal-feedback-result pattern over three topics + two services: goal, cancel, result services; feedback, status topics. The client can monitor progress, cancel, or query status. Used for Nav2.NavigateToPose, MoveIt.MoveGroup, FollowJointTrajectory. Implemented in rclcpp_action / rclpy.action.

Parameters

A parameter is a per-node key-value (typed: bool / int / double / string / arrays of these). Declared at node startup; readable / writable via service at runtime (with set_parameters_atomically). YAML files (params.yaml) loaded via launch are the canonical config mechanism. Parameter change callbacks let nodes react to runtime tweaks.

Lifecycle nodes (REP 2007)

A lifecycle node is a node with a managed state machine:

unconfigured ──configure──▶ inactive ──activate──▶ active
     ▲                          │                     │
     └────cleanup────────deactivate────────────deactivate
                                │
                                └──shutdown──▶ finalized

Each transition fires a callback (on_configure, on_activate, …) where the node can allocate, open hardware, subscribe, or release. Critical for safe init order (hardware drivers come up before controllers) and clean shutdown (controllers stop before motors are de-energised). Used heavily by Nav2 (the whole stack is lifecycle-managed), ros2_control, and any autonomy stack that needs deterministic startup.

Composition

Multiple nodes can share one process via ComponentContainer. A composed graph benefits from:

  • Intra-process communication — pointer-passing of shared messages between matched pub/sub in the same process. With IntraProcessSetting::Enable and reference-counted shared pointers, copy and serialize overhead disappear (≈ 5 µs vs ≈ 100 µs).
  • Shared address space — debug with one debugger, profile with one perf trace.
  • Reduced memory — one DDS participant, one set of OS threads (more or less).

Cost: a crash in any node tears down the whole container. The pattern: compose performance-critical chains (sensor → filter → publisher), isolate safety-critical / risky / experimental nodes.

DDS — what’s underneath

ROS 2’s wire protocol is OMG DDS (Data Distribution Service, v1.4) and DDSI-RTPS (Real-Time Publish-Subscribe v2.5, 2024). DDS is an industrial pub-sub standard from aerospace/defence (LAS-MAS, Aegis combat system, NASA). Key properties:

  • Peer-to-peer discovery via multicast (Simple Discovery Protocol) or static config.
  • Strongly-typed topics with QoS contracts; producer must offer ≥ consumer requests for the match to be valid.
  • Reliable or best-effort delivery, configurable history depth, durability for late joiners.
  • Multicast UDP by default; fall back to unicast on networks that block multicast (Wi-Fi, AWS VPC, cellular).

The RMW layer (ROS Middleware Wrapper) is a thin abstraction over DDS. Switch RMW by setting RMW_IMPLEMENTATION=rmw_cyclonedds_cpp (or rmw_fastrtps_cpp, rmw_connextdds). Inter-RMW interoperability is limited: matched pub/sub need the same RMW family in practice. The community periodically discusses a “ROS 2 gateway” but as of 2026-05 none ships in core.

QoS — the contract that decides whether your message arrives

QoS policies on publisher (offered) must be ≥ subscriber (requested) for the pair to match. The policies you actually configure:

PolicyValuesUse
reliabilityRELIABLE | BEST_EFFORTRELIABLE for commands/safety; BEST_EFFORT for high-rate sensor streams
durabilityTRANSIENT_LOCAL | VOLATILETRANSIENT_LOCAL for late-join (static config, /tf_static); VOLATILE for live data
historyKEEP_LAST(N) | KEEP_ALLKEEP_LAST(1–10) typical; KEEP_ALL only with bounded memory
depthintBuffer size when KEEP_LAST
deadlinedurationTriggers callback if no message in this window (liveness check)
livelinessAUTOMATIC | MANUAL_BY_TOPICHeartbeat strategy
lifespandurationDrop stale messages on the wire

Pre-baked QoS profiles (in rclcpp::QoS):

  • SensorDataQoS — BEST_EFFORT, KEEP_LAST(5), VOLATILE.
  • ServicesQoS — RELIABLE, KEEP_LAST(10), VOLATILE.
  • ParametersQoS — RELIABLE, KEEP_LAST(1000), VOLATILE.
  • SystemDefaultsQoS — RELIABLE, KEEP_LAST(10).
  • Static-transforms QoS — RELIABLE, KEEP_LAST(1), TRANSIENT_LOCAL (so late joiners see /tf_static).

Executors — how callbacks actually run

An executor is the loop that pulls events (subscription messages, timers, service requests, action callbacks) from the wait set and dispatches them to user code:

  • SingleThreadedExecutor — one thread, one callback at a time. Simple; default for small nodes. A blocking callback stalls everything.
  • MultiThreadedExecutor — N threads from a pool dispatch callbacks. Concurrency requires a CallbackGroup: Mutually Exclusive (default, callbacks in the same group never run concurrently) or Reentrant (callbacks may run in parallel, including the same callback re-entering).
  • StaticSingleThreadedExecutor — pre-computes the wait set; lower per-iteration overhead, but no dynamic addition of entities.
  • EventsExecutor (Lyrical+) — event-driven rather than wait-set-polling; lower jitter, suitable for tight real-time loops.

A common bug: a service server in the default mutex group calling another service on the same executor → deadlock. Fix with separate reentrant callback groups.

3. Practical math and worked examples

Example A — Topic latency budget

Producer publishes PointCloud2 chunks at 100 Hz, each chunk 1024 KiB (1 MiB), to a single subscriber:

  • Payload rate — 100 × 1024 KiB = 100 MiB/s = 800 Mbit/s of application data.
  • Cyclone DDS over loopback — measured end-to-end ≈ 150 µs with default UDP socket buffers, no fragmentation issues until ~64 KiB; above that, RTPS fragments → 30+ fragments per message and reassembly overhead adds ~50 µs.
  • Fast DDS shared-memory transport — intra-host ≈ 30 µs (skips UDP; uses POSIX shm + condition variables).
  • Iceoryx (rmw_iceoryx) — zero-copy with rclcpp::SerializedMessage API ≈ 8 µs. The subscriber receives a shared-memory pointer; no copy, no serialize.
  • Cross-host 1 GbE — link latency ~125 µs for 1 MiB at line rate, plus ~50–150 µs DDS overhead → 200–500 µs typical; 10 GbE drops the link portion to ~12 µs.

Numbers are from performance_test benchmarks on a Ryzen 7 PRO 5850U, Ubuntu 22.04, kernel 5.15. Yours will vary ±2× with CPU governor, socket buffer sizing, and competing traffic.

Implication: if you need < 50 µs intra-host pipelines (e.g., RT control with sensor in the loop), you need composition + intra-process or Iceoryx. UDP DDS is fine for sensor → planner (kHz-scale tolerable) but not for the FOC inner current loop (that lives in firmware on the drive, see [[Robotics/pid-control]] §6).

Example B — Composition vs separate processes

Pipeline: 5 perception nodes + 5 planning nodes, all passing 100 KB messages at 50 Hz, fully serial (each consumes from the previous).

Separate processes (10 RMW participants, full serialization):

  • Per-hop cost ≈ 80 µs (serialize 30 + UDP loopback 30 + deserialize 20).
  • End-to-end ≈ 9 hops × 80 µs ≈ 720 µs per cycle.
  • Memory: 10 × ~30 MB DDS participant footprint = ~300 MB just in middleware.

Single ComponentContainer with intra-process (same node graph, one process):

  • Per-hop cost ≈ 5 µs (std::shared_ptr pass).
  • End-to-end ≈ 9 × 5 µs ≈ 45 µs per cycle — 16× faster.
  • Memory: one DDS participant ≈ 30 MB; shared message buffers ref-counted.

Tradeoff to budget: a segfault in any of the 10 composed nodes tears down the whole pipeline. In production, compose the performance-sensitive chain and put the lifecycle manager, parameter bridge, and rosbag2 recorder in separate processes.

Example C — QoS pair for safety-critical odometry

Pipeline: wheel-odometry node → EKF (robot_localization) → controller. 100 Hz nominal; missing samples must be detected within 20 ms.

Producer (wheel-odom node) offered QoS:

  • reliability: RELIABLE
  • history: KEEP_LAST, depth: 10
  • durability: VOLATILE
  • deadline: 20 ms
  • lifespan: 100 ms

Consumer (EKF) requested QoS:

  • reliability: RELIABLE
  • history: KEEP_LAST, depth: 5
  • durability: VOLATILE
  • deadline: 20 ms

Compatibility check — for each policy, the producer’s offer must be ≥ the consumer’s request. RELIABLE ≥ RELIABLE ✓, deadline = deadline ✓ (the consumer accepts producers that publish at least this often), durability VOLATILE = VOLATILE ✓. Match.

A missed deadline fires the RequestedDeadlineMissed callback on the subscriber — the EKF can degrade to dead-reckoning and raise a diagnostics warning to the safety supervisor. The same pattern wraps the controller’s cmd_vel consumer, allowing a watchdog to e-stop on stale commands.

If the producer instead offered BEST_EFFORT — match would fail. The subscriber would silently receive nothing. This is the single most common “topic-is-there-but-no-messages” bug in ROS 2.

4. Design heuristics

RMW selection

  • Default: Cyclone DDS (rmw_cyclonedds_cpp). Eclipse-licensed, broad install base, used by Nav2 / MoveIt 2 tutorials, and the official default since Galactic. Mature.
  • eProsima / certification path: Fast DDS (rmw_fastrtps_cpp). Apache 2; eProsima is an official ROS 2 partner; supports Discovery Server (centralised discovery — good for fleets), shared-memory transport (intra-host ≈30 µs), and security plugin.
  • Commercial / safety-critical: RTI Connext DDS (rmw_connextdds). Paid; certified to DO-178C / IEC 61508; deployed in Boeing, Lockheed, Siemens energy products. Used by Apex.OS (the safety-certified fork of ROS 2).
  • Lowest intra-host latency: Eclipse Iceoryx (rmw_iceoryx_cpp) — Bosch-originated shared-memory pub-sub, zero-copy, < 10 µs latency. Doesn’t work cross-host on its own; pair with Cyclone for cross-host plus Iceoryx intra-host (the “Gateway” pattern).

Lock the RMW per project; mismatched RMW across hosts in the same fleet rarely communicates despite both speaking DDSI-RTPS (interop is OMG-spec but vendor-extension flags rarely line up).

QoS choice

  • Sensor streams: BEST_EFFORT, KEEP_LAST(1–5), VOLATILE. Lossy is fine; freshness wins.
  • Commands and safety: RELIABLE, KEEP_LAST(10), VOLATILE, deadline set.
  • Static config (TF static, robot description): RELIABLE, KEEP_LAST(1), TRANSIENT_LOCAL. Late-join subscribers see the latched value.
  • Diagnostics: RELIABLE, KEEP_LAST(100), VOLATILE; lifespan = 5 s so old warnings drop.
  • Large messages (PointCloud2, Image): BEST_EFFORT + KEEP_LAST(1) intra-host; switch to RELIABLE only if drops cause downstream failure.

Naming conventions (REP 105 / REP 144)

  • Framesworld, map, odom, base_link, base_footprint, imu_link, camera_link, <sensor>_optical_frame.
  • Topics/cmd_vel, /odom, /scan, /imu/data, /camera/image_raw, /camera/camera_info, /joint_states, /tf, /tf_static.
  • Services/get_X, /set_X, /list_X, /clear_X.
  • Per-robot namespace/robot1/cmd_vel for multi-robot; configure via launch.
  • Node names — match the package: wheel_odometry_node, nav2_controller. Keep them short; they appear in graphs.

Workspace layout

A colcon workspace:

my_ws/
├── src/
│   ├── my_package_a/
│   │   ├── package.xml
│   │   ├── CMakeLists.txt
│   │   └── src/...
│   └── my_package_b/
├── build/         # colcon build artefacts
├── install/       # final installed packages (this is sourced)
└── log/           # per-build logs

Overlay/underlay pattern: source /opt/ros/lyrical/setup.bash (the system underlay) → source install/setup.bash (your overlay). The overlay shadows the underlay package-by-package.

colcon build --packages-select my_package_a --symlink-install
source install/setup.bash
ros2 launch my_package_a bringup.launch.py

--symlink-install makes Python and config files editable without rebuilding.

Real-time

  • Kernel: PREEMPT_RT patch (mainlined incrementally since 6.12; full RT in mainline as of 6.13). Use a real-time kernel image (Ubuntu Pro RT).
  • CPU isolation: isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 on the kernel command line. Pin RT threads to those CPUs with sched_setaffinity.
  • Scheduling: SCHED_FIFO priority 80–95 for control threads; let GUI / logging stay SCHED_OTHER.
  • Memory: pre-allocate publishers’ message pools (rclcpp::PublisherOptions::allocator); use mlockall(MCL_CURRENT | MCL_FUTURE) to lock pages; avoid heap in the RT path.
  • Executor: StaticSingleThreadedExecutor or EventsExecutor; pin with sched_setaffinity.
  • RMW: Iceoryx for zero-jitter shm; Cyclone DDS with its network_interface_address pinned otherwise.
  • Result: 100 µs–250 µs worst-case loop jitter is achievable on commodity x86 with PREEMPT_RT; sub-50 µs needs a Xenomai or Codesys-grade stack outside ROS 2.

Embedded / micro-ROS

For MCUs (Cortex-M, ESP32, Renesas RA, Zephyr boards), full DDS is too heavy (≥ 30 MB RAM). micro-ROS uses DDS-XRCE (Extremely Resource-Constrained Environment), a thin client protocol → an “agent” on a Linux host that bridges to full DDS. Memory footprint: < 100 KB RAM, < 200 KB flash. Supports rclc (C-only client library), publishers/subscribers/services, parameters, and (basic) lifecycle. Hooks into FreeRTOS / Zephyr / NuttX / bare-metal.

Pattern: STM32H7 motor controller runs micro-ROS publishing /joint_states and subscribing /joint_commands over USB-CDC or Ethernet → micro-ROS agent on the robot’s onboard Linux SBC bridges to the full DDS graph.

Security

DDS-Security plugin (SROS2) provides:

  • Authentication — X.509 certificates per node identity.
  • Access control — XML governance + permissions files restrict who can publish/subscribe to which topics.
  • Cryptographic protection — AES-GCM payload encryption.

Generate keystore with ros2 security create_keystore; set ROS_SECURITY_ENABLE=true ROS_SECURITY_STRATEGY=Enforce. Cost: ~5–15% extra latency, certificate management overhead, increased complexity in fleet rollout. Required by IEC 62443 for industrial deployments; optional for closed networks.

Multi-machine and fleets

  • ROS_DOMAIN_ID (int 0–101 useful, 0–232 valid) partitions DDS traffic at the wire level. Different domain IDs = different multicast groups = no cross-talk. Always set a non-default domain ID in fleets to avoid accidental cross-vehicle traffic.
  • Firewall: open UDP 7400 + 7400 + 250*domain + N for SPDP/SEDP discovery; tighter rules require static endpoint config.
  • Multi-NIC hosts: pin DDS to a specific interface (CYCLONEDDS_URI) — otherwise DDS may bind to a docker bridge or VPN interface and silently fail.
  • Wi-Fi: multicast on Wi-Fi is unreliable; use Fast DDS Discovery Server (centralised) or static endpoint config.

Logging and tracing

  • Logging: RCLCPP_INFO(get_logger(), "fmt", args…); structured via the spdlog backend; per-node severity tuning. Output to console + log files in ~/.ros/log/.
  • Tracing: tracetools integrates with LTTng; record at microsecond resolution which callback fired when, executor scheduling, DDS publish/receive — essential for diagnosing executor-induced jitter.
  • rosbag2 records arbitrary topics to MCAP (preferred since Iron) or sqlite3. Replay with ros2 bag play rewinds the clock if use_sim_time=true.

Time

  • rclcpp::Time and rclcpp::Clock support multiple sources: RCL_SYSTEM_TIME, RCL_ROS_TIME (settable, used by sim), RCL_STEADY_TIME.
  • Set use_sim_time:=true parameter cluster-wide when replaying bags or running Gazebo; all clocks then read from /clock.
  • Cross-host clock sync: PTP (IEEE 1588) for sub-microsecond; NTP / chrony for ~ms. Required for any sensor fusion that timestamps events on different hosts.

5. Components & sourcing

Real packages, real install lines, real CI patterns.

Core ROS 2

sudo apt install ros-lyrical-ros-base       # minimal (no RViz, no demos)
sudo apt install ros-lyrical-desktop-full   # RViz + demos + perception

System install path: /opt/ros/lyrical/. Sourced via setup.bash.

RMW packages

  • ros-lyrical-rmw-cyclonedds-cpp — default, Eclipse Cyclone DDS.
  • ros-lyrical-rmw-fastrtps-cpp — eProsima Fast DDS, swap with export RMW_IMPLEMENTATION=rmw_fastrtps_cpp.
  • rmw_connextdds — commercial, install via RTI installer; binds to Connext Pro.
  • rmw_iceoryx_cpp — Eclipse Iceoryx; build from source or use Apex/Bosch images.

Build

  • colcon (pip install -U colcon-common-extensions or apt install python3-colcon-common-extensions).
  • rosdep for system-package resolution (rosdep install --from-paths src --ignore-src -y).
  • vcstool (vcs import < repos.yaml) for multi-repo workspaces.

Visualization

  • RViz 2 (ros-lyrical-rviz2) — the canonical 3D viewer; displays for TF, occupancy grids, point clouds, markers, robot model.
  • Foxglove Studio (foxglove.dev) — modern web/desktop viewer; MCAP-native, far better at large bags than RViz; free desktop, paid team cloud.
  • PlotJuggler (ros-lyrical-plotjuggler-ros) — multi-topic time-series plotter; the standard for tuning controllers and reviewing bag data.
  • rqt (ros-lyrical-rqt-common-plugins) — Qt-based GUI dispatcher; rqt_graph, rqt_topic, rqt_console, rqt_reconfigure.

Bag and replay

  • rosbag2 (ros-lyrical-ros2bag + ros-lyrical-rosbag2-storage-mcap) — MCAP is the recommended storage as of Iron+.
  • mcap CLI (Foxglove) — mcap info, mcap convert, mcap merge — works on bags outside ROS.

micro-ROS

  • micro-ROS/micro_ros_setup GitHub — installer for ESP32, STM32, Renesas RA, Zephyr, NuttX, FreeRTOS targets.
  • STM32CubeMX / IDE component; ESP-IDF component; Zephyr west module.
  • Agent: micro_ros_agent runs on the Linux host (USB-CDC, UDP, serial, or CAN-XL transports).
  • Nav2 (ros-lyrical-navigation2) — Steve Macenski et al.’s production navigation stack: planner, controller, costmap, behaviour-tree executor, recoveries.
  • slam_toolbox — Steve Macenski’s 2D LiDAR SLAM; the default 2D map builder under Nav2.
  • MoveIt 2 (ros-lyrical-moveit) — arm motion planning, kinematics, Ruckig-based trajectory smoothing; PickNik commercial support.
  • tf2_ros — coordinate frame transforms (broadcaster + listener + buffer).

Drivers (200+ packages)

Pattern: ros-${distro}-${vendor}-${device}. Examples:

  • ros-lyrical-velodyne — Velodyne LiDAR (VLP-16/32).
  • ros-lyrical-livox-ros-driver2 — Livox Mid-360 / Avia.
  • ros-lyrical-realsense2-camera — Intel RealSense D435/D455.
  • ros-lyrical-zed-ros2-wrapper — Stereolabs ZED 2/X.
  • ros-lyrical-vesc — VESC motor controllers (mobile bases).
  • ros-lyrical-ur-robot-driver — Universal Robots UR3e/5e/10e/16e/20.
  • ros-lyrical-franka-ros2 — Franka Emika Panda / Research 3.

Simulators

  • Gazebo Sim (gz-harmonic / gz-ionic) — the new modular Gazebo; integrates via ros_gz_bridge.
  • Nvidia Isaac Sim + isaac_ros packages — GPU-accelerated, photoreal, USD scenes.
  • Webots — open-source, free; webots_ros2.
  • MuJoCo MJX — fast for legged/contact-rich; mujoco_ros2_control.

Diagnostics

  • ros2 doctor — health check (RMW match, network, packages).
  • ros2 topic hz / bw / delay / echo — runtime introspection.
  • performance_metrics and performance_test — benchmark RMW latency / throughput.
  • tracetools + tracetools_analysis — LTTng tracing of executor and DDS.

IDE

  • VS Code + Microsoft “ROS” extension + C++ + Python + URDF preview.
  • CLion + JetBrains ROS plugin (community).
  • Vim/Emacs + clangd + pylsp + colcon make targets.

6. Reference data

Distro EOL matrix (as of 2026-05)

DistroReleasedEOLLTSDefault RMWUbuntu
Foxy2020-052023-05YFast DDS20.04
Galactic2021-052022-12NCyclone20.04
Humble2022-052027-05YFast DDS22.04
Iron2023-052024-11NCyclone22.04
Jazzy2024-052029-05YFast DDS24.04
Kilted2025-062027-05NCyclone24.04
Lyrical Luth2026-052031-05YCyclone24.04 / 26.04

RMW comparison

RMWLicenseIntra-host latencyCross-hostShm?SecurityBest for
Cyclone DDSEPL/EDL~150 µs UDPyesoptionalyesdefault; broad fleet
Fast DDSApache 2~30 µs (shm transport)yesyes (built-in)yesdiscovery-server fleets, certification path
RTI ConnextCommercial~80 µsyesyesDO-178C, IEC 61508safety-critical aerospace/defence
IceoryxApache 2< 10 µs (true zero-copy)no (intra-host only)yes (only mode)notight RT pipelines on one host

QoS policy reference

PolicyDefault profileMatch rule
reliabilityRELIABLEproducer ≥ consumer (R > BE)
durabilityVOLATILEproducer ≥ consumer (TL > V)
deadlineInfiniteproducer offered ≤ consumer requested
livelinessAUTOMATICproducer ≥ consumer (auto > manual_node > manual_topic)
lifespanInfiniteproducer offered ≤ consumer requested
historyKEEP_LAST(10)not part of matching, but bounds buffer
depth10local buffer size

Common packages by domain

DomainPackages
Perceptionimage_transport, cv_bridge, pcl_ros, depthai-ros, vision_msgs
State estimationrobot_localization (EKF/UKF for /odom→/map), imu_filter_madgwick, gtsam_ros2
SLAMslam_toolbox, cartographer_ros, rtabmap_ros, lio_sam
Planningnav2_* (bringup, planner, controller, behavior_tree), moveit2, pilz_industrial_motion_planner
Controlros2_control, ros2_controllers (joint_trajectory, diff_drive, ackermann, mecanum), control_toolbox
Driversros2_canopen, ros2_socketcan, serial_driver, joystick_drivers, usb_cam, realsense2_camera
Simulationros_gz, gazebo_ros2_control, isaac_ros_*, webots_ros2_*
Toolsrqt_*, rviz2, plotjuggler, foxglove_bridge, rosbag2, tracetools

Topic conventions (REP 105 / REP 144 / REP 145)

TopicTypeFrameRate
/tftf2_msgs/TFMessagemixedevent-driven
/tf_statictf2_msgs/TFMessagestaticlatched
/cmd_velgeometry_msgs/Twistbase_link10–100 Hz
/odomnav_msgs/Odometryodombase_link50–100 Hz
/joint_statessensor_msgs/JointStatejoint frames50–1000 Hz
/imu/datasensor_msgs/Imuimu_link100–1000 Hz
/scansensor_msgs/LaserScanlaser_frame10–40 Hz
/camera/image_rawsensor_msgs/Imagecamera_optical_frame30–60 Hz
/camera/camera_infosensor_msgs/CameraInfocamera_optical_frame30–60 Hz
/diagnosticsdiagnostic_msgs/DiagnosticArray1 Hz

7. Failure modes and debugging

Topic-graph mismatches

  • Topic name typo / wrong remapros2 topic list shows the topic, but ros2 node info <node> shows it’s bound to the wrong name. Fix in the launch file’s remappings=[('/old', '/new')].
  • QoS incompatibility → publisher and subscriber both exist on the right topic, but no messages flow. ros2 topic info /topic --verbose shows the offered vs requested QoS; the Endpoint Type column flags the incompatibility. RELIABLE vs BEST_EFFORT is the #1 culprit; durability is #2.
  • Type mismatch → silent on the same topic name with different message types; ros2 topic info shows the type registered.

Discovery problems

  • No nodes seen across hosts → check ROS_DOMAIN_ID matches; check RMW_IMPLEMENTATION matches; check firewall (UDP 7400 + 250*domain). Run ros2 multicast receive on one host and ros2 multicast send on the other.
  • Discovery storm in fleets (50+ nodes) → switch to Fast DDS Discovery Server mode, which centralises discovery and removes the O(N²) multicast spam.
  • Nodes appear briefly then vanish → liveliness deadline misconfigured; or DDS participant being created and destroyed by composition.
  • Wrong NIC bound → docker0 / VPN interface intercepts multicast. Pin with CYCLONEDDS_URI=<XML with interface=eth0> or FASTRTPS_DEFAULT_PROFILES_FILE.

Time and timing

  • use_sim_time mismatch across nodes → some nodes use wall time, others use /clock; transforms timestamped in disagreeing clocks fail tf2 lookups. Always set the parameter cluster-wide (in the YAML, not just on individual nodes).
  • Cross-host clock drift → if NTP only, expect ±5–50 ms; for tight sensor fusion you need PTP (chrony-ptp or ptp4l + phc2sys).
  • Bag replay rate drift → if the disk is slow, bag plays back slower than real-time; ros2 bag play --rate 1.0 shows actual achieved rate. Use NVMe storage for high-rate bags.

Lifecycle problems

  • Lifecycle deadlockon_configure throws or never returns; transition blocks indefinitely. Wrap callbacks in try/catch returning TRANSITION_CALLBACK_FAILURE; add a 30 s timeout in the lifecycle manager.
  • Nav2 nodes won’t transition → check the lifecycle manager’s bringup order (Nav2 LifecycleManager brings up nodes in the order listed; map_server before amcl before planner before controller).

Memory and buffers

  • Slow subscriber, growing memory → KEEP_ALL durability with no consumer-side flow control. Switch to KEEP_LAST(N) with a tight depth.
  • Late-join sees nothing on /tf_static/tf_static must be TRANSIENT_LOCAL on both ends; the default profile in tf2_ros::StaticTransformBroadcaster is correct, but custom publishers often miss this.
  • Memory leaks in long runs → use valgrind --tool=memcheck (slow) or heaptrack; the usual offenders are message buffers in keep-all, or rclpy GIL-locked references.

Performance

  • High CPU at idle → spin loop in a custom node, often a while(rclcpp::ok()) with no sleep. Use executor.spin(), not spin_some in a busy loop.
  • Latency spikes → page faults from heap allocation in the RT path. Pre-allocate; mlockall; use real-time-safe loggers.
  • Python rclpy slow → GIL serialises callbacks; high-rate topics in Python lose messages. Move the hot path to C++; keep Python for orchestration and config.
  • spin_some starvation → in a custom loop calling executor.spin_some(), slow callbacks crowd out fast ones. Use MultiThreadedExecutor with reentrant callback groups.

Build

  • Stale colcon cachecolcon build --packages-select X --cmake-clean-cache; if that fails, rm -rf build/X install/X and rebuild.
  • Bloom-released package conflict → mixing apt-installed packages with workspace-built ones of the same name; the overlay wins for symbols but the cmake export may pick up the wrong path. colcon list --topological-order shows the build order.

Composition / executor

  • Composed nodes share a fault — a segfault in any of 10 nodes in a ComponentContainer kills the container. Separate safety-critical from experimental code.
  • spin_some re-entrancy — calling spin_some from inside a callback dispatched by the same executor → undefined behaviour. Use a different executor or callback group.

Inter-RMW

  • Mixed RMW deployment (Cyclone on robot, Fast DDS on workstation) → topic graph appears empty across the boundary, sometimes partially populated. All hosts in a domain must use the same RMW family in practice. The community gateway proposal is still draft as of 2026-05.

8. Case studies

TurtleBot 4 — the ROS 2 reference design (Clearpath / iRobot 2022)

The TurtleBot 4 is the canonical first ROS 2 robot for any new lab: a Create 3 mobile base + a Raspberry Pi 4 (Standard) or RPi 4 + Oak-D camera (Pro), Lyrical-Luth-supported stock as of 2026-05. The architecture is a faithful exemplar of “what ROS 2 looks like on a real robot”:

  • Create 3 firmware speaks CRC32-framed UDP to the RPi running the ROS 2 stack — but exposes itself as a set of ROS 2 nodes via an onboard micro-ROS firmware (the Create 3 ships with micro-ROS on its STM32 MCU). The driver topics /cmd_vel, /odom, /imu, /wheel_status are first-class DDS topics.
  • Compute: RPi 4 (4 GB) running Ubuntu 22.04 + Humble or Ubuntu 24.04 + Jazzy; Cyclone DDS.
  • Stack: slam_toolbox for 2D mapping → Nav2 for path planning + controller, all lifecycle-managed under nav2_lifecycle_manager.
  • Why it matters: it bundles every “production” ROS 2 pattern — micro-ROS at the actuator boundary, lifecycle-managed Nav2, MCAP bagging, Foxglove or RViz for visualisation, all reachable from a laptop on the same Wi-Fi (Discovery Server config is the documented workaround for flaky multicast).

The TurtleBot 4 has become the de facto syllabus reference for university robotics courses since 2023 — when a student asks “how do I make my robot do X?”, the answer is usually “look at how TurtleBot 4 does X, then change one node.”

Boston Dynamics Spot — community ROS 2 bridge (bdaiinstitute/spot_ros2)

Spot’s official SDK is gRPC + Protobuf, not ROS 2 — Boston Dynamics ships it that way for backward compatibility with non-ROS customers. The community-maintained spot_ros2 driver wraps the SDK:

  • A spot_driver lifecycle node holds the gRPC client → robot’s onboard compute over Wi-Fi or Ethernet (PoE on Spot Enterprise).
  • Publishes the standard ROS 2 topics: /odom, /tf (full kinematic tree from URDF), /joint_states, /cameras/<location>/image, /depth/<location>/image, /lidar/points (Velodyne VLP-16 on Spot Enterprise).
  • Provides action servers for Sit, Stand, WalkTo, ExecuteBehavior — wrapping BD’s choreographer.
  • Why it matters: it’s the working pattern for “company has a proprietary SDK, we want it in the ROS 2 graph” — write a lifecycle node bridge with the proprietary client inside and ROS 2 publishers/services/actions outside. The same pattern shows up for ABB IRC5 (abb_robot_driver), KUKA KRC (kuka_drivers), Fanuc R-30iB (fanuc_driver).

The bridge is GPLv3 / Apache 2; Boston Dynamics has explicitly endorsed it for research and educational use; integration with Nav2 and MoveIt 2 works out of the box for navigation but requires custom controllers for the legged manipulator.

Autoware Universe + ros2_control on autonomous vehicles (Tier IV, Apex.AI)

Autoware Universe is the open-source autonomous-driving stack maintained by the Autoware Foundation (Tier IV, Apex.AI, Arm, Toyota, others). It is the largest ROS 2 deployment by code volume — 1.2 M LOC as of 2026-05, spanning perception, prediction, planning, control. Running on Tier IV’s robobus pilots in Tokyo and Nagoya since 2022.

  • RMW: Apex.OS (Apex.AI’s safety-certified fork of ROS 2 + Cyclone DDS), or Fast DDS in non-certified deployments.
  • Compute: Nvidia Drive Orin (~250 TOPS) or industrial PC (i7/i9 + RTX GPU); on cars with redundancy, two compute nodes synchronised via a watchdog (one primary, one hot-spare).
  • ros2_control hardware interface drives the vehicle (throttle, brake, steering) via CAN: autoware_universe::vehicle_interface exposes /control/command/control_cmd and /vehicle/status/*; each vehicle vendor implements the hardware_interface::SystemInterface for their drive-by-wire kit (Pacifica Pacmod, Lexus RX450h custom, Aichi MOC).
  • Lifecycle: every Autoware subsystem (Perception, Localisation, Planning, Control) is a lifecycle group; the supervisor enforces dependency order at startup.
  • Tracing: LTTng + tracetools, integrated into Autoware’s CI to detect callback regressions ≥ 50 µs.
  • Real-time guarantee: 100 Hz control loop hard deadline; planner runs 10 Hz; perception 10–20 Hz. Achieved with PREEMPT_RT + Iceoryx intra-process for the perception → prediction → planner chain.

Why it matters: this is the upper envelope of what ROS 2 supports — life-safety-critical software, multi-vendor hardware, certification trace, deployed at city scale. Apex.OS’s safety case for ROS 2 is the public proof that the architecture is fit for ISO 26262 ASIL D adjacency (Apex.OS itself is ASIL D; mainline ROS 2 is QM with engineering controls).

9. Cross-references

Robotics (this library):

  • [[Robotics/comm-buses]] — companion systems note on the physical-layer buses (CAN, EtherCAT, I²C, SPI, UART, USB) under the ROS 2 driver/hardware-interface layer.
  • [[Robotics/pid-control]] — ros2_control’s controller plugins and the discrete PID loops they run.
  • [[Robotics/slam]]slam_toolbox, cartographer_ros, rtabmap_ros, lio_sam are all ROS 2 nodes consuming /scan, /imu/data, /odom.
  • [[Robotics/path-planning]] — Nav2’s global / local planners, costmaps, behaviour-tree executor.
  • [[Robotics/trajectory-generation]] — MoveIt 2’s Ruckig-based trajectory smoother sits inside a JointTrajectoryController.
  • [[Robotics/manipulator-design]] — URDF + Xacro robot descriptions, ros2_control wiring of arms.
  • [[Robotics/mobile-base-wheeled]]diff_drive_controller, ackermann_controller, mecanum_drive_controller in ros2_controllers.
  • [[Robotics/sensors-perception]] — camera, LiDAR, depth-sensor drivers and their ROS 2 message conventions.
  • [[Robotics/sensors-pose-motion]] — IMU, encoder, GPS drivers publishing /imu/data, /joint_states, /gps/fix.
  • planned [[Robotics/safety-standards]] — IEC 61508 / ISO 26262 / ISO 13849 path for ROS 2 (Apex.OS, mainline + engineering controls).

Engineering (foundations):

  • [[Engineering/realtime-embedded]] — PREEMPT_RT, SCHED_FIFO, RTOS schedulers, jitter analysis underlying ROS 2’s RT story.
  • [[Engineering/microcontrollers]] — STM32 / Renesas RA / ESP32 / RP2040, where micro-ROS runs.
  • [[Engineering/digital-control]] — sample rates, aliasing, Tustin discretization for the control loops inside ros2_control.
  • planned [[Engineering/Tier3/connector-families]] — UDP, multicast, NIC tuning, PTP/NTP that DDS rides on.
  • planned [[Engineering/realtime-embedded]] — consensus, time sync, fault tolerance for multi-robot fleets.

Languages:

  • [[Languages/Tier3/ros2-robotics-config]] — DSL surface: URDF, Xacro, behaviour-tree XML, launch.py, rclcpp/rclpy idioms, package.xml, CMakeLists.txt patterns.
  • [[Languages/Tier3/robotics-control]] — ros2_control plugin manifest, controller_manager YAML, the spawner script.
  • [[Languages/Tier3/3d-scene]] — point-cloud and mesh formats consumed by RViz and Foxglove.

10. Citations

Foundational papers

  • Macenski, S., Foote, T., Gerkey, B., Lalancette, C. & Woodall, W. (2022). “Robot Operating System 2: Design, Architecture, and Uses In The Wild.” Science Robotics 7(66), eabm6074. DOI 10.1126/scirobotics.abm6074. The canonical ROS 2 design paper.
  • Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R. & Ng, A. (2009). “ROS: an open-source Robot Operating System.” ICRA Workshop on Open Source Software. The original ROS 1 paper; concept lineage.
  • Maruyama, Y., Kato, S. & Azumi, T. (2016). “Exploring the performance of ROS2.” EMSOFT 2016, 5:1–5:10. DOI 10.1145/2968478.2968502. Early ROS 2 latency / jitter measurements.
  • Casini, D., Blass, T., Lütkebohle, I. & Brandenburg, B. (2019). “Response-Time Analysis of ROS 2 Processing Chains under Reservation-Based Scheduling.” ECRTS 2019. Formal analysis of ROS 2 executor scheduling.
  • Pham, T. P., Kounev, S. et al. (2022). “Reactive Microservices and ROS 2: Performance Insights.” ICAS 2022. Throughput benchmarking across RMWs.

Standards and REPs

  • OMG (2015). Data Distribution Service for Real-time Systems, Version 1.4. https://www.omg.org/spec/DDS/1.4/.
  • OMG (2024). Real-Time Publish-Subscribe Protocol DDS Interoperability Wire Protocol, Version 2.5 (DDSI-RTPS). https://www.omg.org/spec/DDSI-RTPS/2.5/.
  • ROS Enhancement Proposals: REP 105 (Coordinate Frames for Mobile Platforms), REP 144 (ROS Package Naming), REP 145 (Conventions for IMU Sensor Drivers), REP 2003 (ROS 2 Policies), REP 2007 (Managed Lifecycle Nodes), REP 2014 (Hardware Interfaces and Controllers for ros2_control). https://ros.org/reps/.
  • ISO 26262:2018. Road vehicles — Functional safety. Relevant to Apex.OS and Autoware certification.
  • IEC 61508:2010. Functional safety of electrical/electronic/programmable electronic safety-related systems.

Books

  • Jiménez, F. M. R. (2023). A Concise Introduction to Robot Programming with ROS 2. Chapman & Hall / CRC. The current short text.
  • Newman, W. S. (2024). A Systematic Approach to Learning Robot Programming with ROS 2 (2nd ed.). CRC Press. Comprehensive textbook.
  • Quigley, M., Gerkey, B. & Smart, W. (2015). Programming Robots with ROS. O’Reilly. ROS 1 era; concepts mostly transfer.
  • Mahtani, A., Sanchez, L., Fernandez, E. & Martinez, A. (2018). ROS Robotics Projects (2nd ed.). Packt.

Software documentation

Open Robotics primary refs


Session log appended via node ~/.claude/bin/obsidian-research.mjs log "Built Robotics/ros2-architecture.md Tier 1 deep note".