ROS 2 Architecture & Runtime — Robotics Reference
Scope. ROS 2 as a runtime system — nodes, executors, the DDS middleware, QoS, lifecycle, composition, the launch system, workspaces, and the operational gotchas that decide whether a robot ships. The DSL surface (URDF, Xacro, behaviour-tree XML, launch.py) lives in
[[Languages/Tier3/ros2-robotics-config]]and[[Languages/Tier3/robotics-control]]. ros2_control’s controller plugins, command/state interfaces, and the FOC-adjacent current loops sit in[[Robotics/pid-control]]. This note is the systems spine that the other Robotics notes plug into.
1. At a glance
ROS 2 = open-source robotics middleware + tools + ecosystem. Replaces ROS 1 (Melodic 2018, Noetic 2020, EOL May 2025) with a DDS-based, real-time-capable, production-grade architecture. Maintained by Open Robotics → Open Source Robotics Foundation, with industry working groups (Apex.AI, Bosch, eProsima, iRobot, Nvidia, PickNik). Stable on Ubuntu (primary), Windows 10/11, macOS, and embedded Linux (Yocto, NXP, Raspberry Pi).
As of 2026-05, the active distro landscape:
| Distro | Release | EOL | LTS? | Notes |
|---|---|---|---|---|
| Foxy Fitzroy | May 2020 | May 2023 | yes | dead; legacy projects only |
| Galactic | May 2021 | Dec 2022 | no | dead |
| Humble Hawksbill | May 2022 | May 2027 | yes | still the most-deployed LTS in industry |
| Iron Irwini | May 2023 | Nov 2024 | no | dead; many docs still cite it |
| Jazzy Jalisco | May 2024 | May 2029 | yes | current “safe” LTS for new builds |
| Kilted Kaiju | Jun 2025 | May 2027 | no | bleeding-edge non-LTS |
| Lyrical Luth | May 2026 | May 2031 | yes | latest LTS, default for new projects in 2026 |
ROS 2 inherited ROS 1’s concepts but threw out its implementation:
- No
roscore— discovery is peer-to-peer via DDS (or a Discovery Server). - Topics, services, parameters retained; actions rewritten on top of topics+services.
- C++ and Python first-class via
rclcpp/rclpy; both built on the same C-levelrcllayer. - Build system:
colcon(Python) replacescatkin_make;ament_cmake/ament_pythonare the package conventions. - Real-time path — preemptable kernel, lock-free executors, pre-allocated memory, zero-copy intra-process.
- Multi-machine, multi-vendor middleware — Cyclone DDS, Fast DDS, RTI Connext, Iceoryx shm.
- Lifecycle nodes (REP 2007) — explicit state machine for safe init / shutdown.
- Security — DDS-Security plugin: TLS, authentication, access control.
Where it sits in the design stack. Sensors → drivers (ROS 2 publishers) → state estimation (robot_localization, SLAM nodes) → planner (Nav2, MoveIt 2) → controller (ros2_control + custom controllers) → hardware abstraction (hardware_interface) → motor drives. tf2 weaves transforms across all of it. rosbag2 records everything for replay.
First ask before applying ROS 2:
- Distro? Lyrical Luth (LTS, 2026–2031) for new projects; Humble (LTS until 2027) if your platform already runs it.
- RMW? Cyclone DDS (default, BSD, broad), Fast DDS (eProsima, official partner), Connext (commercial, certified), or Iceoryx (shm, tightest latency).
- Real-time? PREEMPT_RT kernel + isolated CPUs + SCHED_FIFO + pre-allocated memory + multi-threaded executor.
- Multi-machine? Same RMW on all hosts, identical
ROS_DOMAIN_ID, firewall open for DDS discovery (UDP 7400+). - Embedded? micro-ROS (DDS-XRCE) on STM32 / Renesas RA / ESP32 / Zephyr, agent on a Linux host.
- Safety/regulated? Apex.OS (commercial certified fork), or DDS-Security + lifecycle + redundancy patterns.
2. First principles
Nodes and processes
A node is the unit of computation. A node owns publishers, subscribers, service servers/clients, action servers/clients, parameters, and a clock. Multiple nodes can live in one OS process (ComponentContainer / composition), each with its own rclcpp::Node instance. The node-to-process map is independent of the topic graph — that is what makes ROS 2 “composable.”
Topics — anonymous pub-sub
A topic is a typed multicast bus identified by a string name (/cmd_vel) and a message type (geometry_msgs/msg/Twist). Publishers and subscribers do not know about each other; DDS discovery matches them by name + type + compatible QoS. Many-to-many is supported (multiple publishers on the same topic, multiple subscribers).
Message types are defined in .msg IDL → generated C++/Python at build time. ROS 2’s rosidl toolchain emits both the type-support library and serializers per RMW.
Services — synchronous request/response
A service is a one-to-one (per-client-per-server) call: /get_state, /set_parameters. Request and response are separate typed messages. Servers return synchronously in their executor thread; clients can wait or use a future. Use services for idempotent, fast calls; never for long-running work — that’s what actions are for.
Actions — long-running goal/feedback/result
An action wraps the goal-feedback-result pattern over three topics + two services: goal, cancel, result services; feedback, status topics. The client can monitor progress, cancel, or query status. Used for Nav2.NavigateToPose, MoveIt.MoveGroup, FollowJointTrajectory. Implemented in rclcpp_action / rclpy.action.
Parameters
A parameter is a per-node key-value (typed: bool / int / double / string / arrays of these). Declared at node startup; readable / writable via service at runtime (with set_parameters_atomically). YAML files (params.yaml) loaded via launch are the canonical config mechanism. Parameter change callbacks let nodes react to runtime tweaks.
Lifecycle nodes (REP 2007)
A lifecycle node is a node with a managed state machine:
unconfigured ──configure──▶ inactive ──activate──▶ active
▲ │ │
└────cleanup────────deactivate────────────deactivate
│
└──shutdown──▶ finalized
Each transition fires a callback (on_configure, on_activate, …) where the node can allocate, open hardware, subscribe, or release. Critical for safe init order (hardware drivers come up before controllers) and clean shutdown (controllers stop before motors are de-energised). Used heavily by Nav2 (the whole stack is lifecycle-managed), ros2_control, and any autonomy stack that needs deterministic startup.
Composition
Multiple nodes can share one process via ComponentContainer. A composed graph benefits from:
- Intra-process communication — pointer-passing of shared messages between matched pub/sub in the same process. With
IntraProcessSetting::Enableand reference-counted shared pointers, copy and serialize overhead disappear (≈ 5 µs vs ≈ 100 µs). - Shared address space — debug with one debugger, profile with one perf trace.
- Reduced memory — one DDS participant, one set of OS threads (more or less).
Cost: a crash in any node tears down the whole container. The pattern: compose performance-critical chains (sensor → filter → publisher), isolate safety-critical / risky / experimental nodes.
DDS — what’s underneath
ROS 2’s wire protocol is OMG DDS (Data Distribution Service, v1.4) and DDSI-RTPS (Real-Time Publish-Subscribe v2.5, 2024). DDS is an industrial pub-sub standard from aerospace/defence (LAS-MAS, Aegis combat system, NASA). Key properties:
- Peer-to-peer discovery via multicast (Simple Discovery Protocol) or static config.
- Strongly-typed topics with QoS contracts; producer must offer ≥ consumer requests for the match to be valid.
- Reliable or best-effort delivery, configurable history depth, durability for late joiners.
- Multicast UDP by default; fall back to unicast on networks that block multicast (Wi-Fi, AWS VPC, cellular).
The RMW layer (ROS Middleware Wrapper) is a thin abstraction over DDS. Switch RMW by setting RMW_IMPLEMENTATION=rmw_cyclonedds_cpp (or rmw_fastrtps_cpp, rmw_connextdds). Inter-RMW interoperability is limited: matched pub/sub need the same RMW family in practice. The community periodically discusses a “ROS 2 gateway” but as of 2026-05 none ships in core.
QoS — the contract that decides whether your message arrives
QoS policies on publisher (offered) must be ≥ subscriber (requested) for the pair to match. The policies you actually configure:
| Policy | Values | Use |
|---|---|---|
reliability | RELIABLE | BEST_EFFORT | RELIABLE for commands/safety; BEST_EFFORT for high-rate sensor streams |
durability | TRANSIENT_LOCAL | VOLATILE | TRANSIENT_LOCAL for late-join (static config, /tf_static); VOLATILE for live data |
history | KEEP_LAST(N) | KEEP_ALL | KEEP_LAST(1–10) typical; KEEP_ALL only with bounded memory |
depth | int | Buffer size when KEEP_LAST |
deadline | duration | Triggers callback if no message in this window (liveness check) |
liveliness | AUTOMATIC | MANUAL_BY_TOPIC | Heartbeat strategy |
lifespan | duration | Drop stale messages on the wire |
Pre-baked QoS profiles (in rclcpp::QoS):
SensorDataQoS— BEST_EFFORT, KEEP_LAST(5), VOLATILE.ServicesQoS— RELIABLE, KEEP_LAST(10), VOLATILE.ParametersQoS— RELIABLE, KEEP_LAST(1000), VOLATILE.SystemDefaultsQoS— RELIABLE, KEEP_LAST(10).- Static-transforms QoS — RELIABLE, KEEP_LAST(1), TRANSIENT_LOCAL (so late joiners see /tf_static).
Executors — how callbacks actually run
An executor is the loop that pulls events (subscription messages, timers, service requests, action callbacks) from the wait set and dispatches them to user code:
- SingleThreadedExecutor — one thread, one callback at a time. Simple; default for small nodes. A blocking callback stalls everything.
- MultiThreadedExecutor — N threads from a pool dispatch callbacks. Concurrency requires a
CallbackGroup: Mutually Exclusive (default, callbacks in the same group never run concurrently) or Reentrant (callbacks may run in parallel, including the same callback re-entering). - StaticSingleThreadedExecutor — pre-computes the wait set; lower per-iteration overhead, but no dynamic addition of entities.
- EventsExecutor (Lyrical+) — event-driven rather than wait-set-polling; lower jitter, suitable for tight real-time loops.
A common bug: a service server in the default mutex group calling another service on the same executor → deadlock. Fix with separate reentrant callback groups.
3. Practical math and worked examples
Example A — Topic latency budget
Producer publishes PointCloud2 chunks at 100 Hz, each chunk 1024 KiB (1 MiB), to a single subscriber:
- Payload rate — 100 × 1024 KiB = 100 MiB/s = 800 Mbit/s of application data.
- Cyclone DDS over loopback — measured end-to-end ≈ 150 µs with default UDP socket buffers, no fragmentation issues until ~64 KiB; above that, RTPS fragments → 30+ fragments per message and reassembly overhead adds ~50 µs.
- Fast DDS shared-memory transport — intra-host ≈ 30 µs (skips UDP; uses POSIX shm + condition variables).
- Iceoryx (rmw_iceoryx) — zero-copy with
rclcpp::SerializedMessageAPI ≈ 8 µs. The subscriber receives a shared-memory pointer; no copy, no serialize. - Cross-host 1 GbE — link latency ~125 µs for 1 MiB at line rate, plus ~50–150 µs DDS overhead → 200–500 µs typical; 10 GbE drops the link portion to ~12 µs.
Numbers are from performance_test benchmarks on a Ryzen 7 PRO 5850U, Ubuntu 22.04, kernel 5.15. Yours will vary ±2× with CPU governor, socket buffer sizing, and competing traffic.
Implication: if you need < 50 µs intra-host pipelines (e.g., RT control with sensor in the loop), you need composition + intra-process or Iceoryx. UDP DDS is fine for sensor → planner (kHz-scale tolerable) but not for the FOC inner current loop (that lives in firmware on the drive, see [[Robotics/pid-control]] §6).
Example B — Composition vs separate processes
Pipeline: 5 perception nodes + 5 planning nodes, all passing 100 KB messages at 50 Hz, fully serial (each consumes from the previous).
Separate processes (10 RMW participants, full serialization):
- Per-hop cost ≈ 80 µs (serialize 30 + UDP loopback 30 + deserialize 20).
- End-to-end ≈ 9 hops × 80 µs ≈ 720 µs per cycle.
- Memory: 10 × ~30 MB DDS participant footprint = ~300 MB just in middleware.
Single ComponentContainer with intra-process (same node graph, one process):
- Per-hop cost ≈ 5 µs (
std::shared_ptrpass). - End-to-end ≈ 9 × 5 µs ≈ 45 µs per cycle — 16× faster.
- Memory: one DDS participant ≈ 30 MB; shared message buffers ref-counted.
Tradeoff to budget: a segfault in any of the 10 composed nodes tears down the whole pipeline. In production, compose the performance-sensitive chain and put the lifecycle manager, parameter bridge, and rosbag2 recorder in separate processes.
Example C — QoS pair for safety-critical odometry
Pipeline: wheel-odometry node → EKF (robot_localization) → controller. 100 Hz nominal; missing samples must be detected within 20 ms.
Producer (wheel-odom node) offered QoS:
reliability: RELIABLEhistory: KEEP_LAST,depth: 10durability: VOLATILEdeadline: 20 mslifespan: 100 ms
Consumer (EKF) requested QoS:
reliability: RELIABLEhistory: KEEP_LAST,depth: 5durability: VOLATILEdeadline: 20 ms
Compatibility check — for each policy, the producer’s offer must be ≥ the consumer’s request. RELIABLE ≥ RELIABLE ✓, deadline = deadline ✓ (the consumer accepts producers that publish at least this often), durability VOLATILE = VOLATILE ✓. Match.
A missed deadline fires the RequestedDeadlineMissed callback on the subscriber — the EKF can degrade to dead-reckoning and raise a diagnostics warning to the safety supervisor. The same pattern wraps the controller’s cmd_vel consumer, allowing a watchdog to e-stop on stale commands.
If the producer instead offered BEST_EFFORT — match would fail. The subscriber would silently receive nothing. This is the single most common “topic-is-there-but-no-messages” bug in ROS 2.
4. Design heuristics
RMW selection
- Default: Cyclone DDS (
rmw_cyclonedds_cpp). Eclipse-licensed, broad install base, used by Nav2 / MoveIt 2 tutorials, and the official default since Galactic. Mature. - eProsima / certification path: Fast DDS (
rmw_fastrtps_cpp). Apache 2; eProsima is an official ROS 2 partner; supports Discovery Server (centralised discovery — good for fleets), shared-memory transport (intra-host ≈30 µs), and security plugin. - Commercial / safety-critical: RTI Connext DDS (
rmw_connextdds). Paid; certified to DO-178C / IEC 61508; deployed in Boeing, Lockheed, Siemens energy products. Used by Apex.OS (the safety-certified fork of ROS 2). - Lowest intra-host latency: Eclipse Iceoryx (
rmw_iceoryx_cpp) — Bosch-originated shared-memory pub-sub, zero-copy, < 10 µs latency. Doesn’t work cross-host on its own; pair with Cyclone for cross-host plus Iceoryx intra-host (the “Gateway” pattern).
Lock the RMW per project; mismatched RMW across hosts in the same fleet rarely communicates despite both speaking DDSI-RTPS (interop is OMG-spec but vendor-extension flags rarely line up).
QoS choice
- Sensor streams: BEST_EFFORT, KEEP_LAST(1–5), VOLATILE. Lossy is fine; freshness wins.
- Commands and safety: RELIABLE, KEEP_LAST(10), VOLATILE, deadline set.
- Static config (TF static, robot description): RELIABLE, KEEP_LAST(1), TRANSIENT_LOCAL. Late-join subscribers see the latched value.
- Diagnostics: RELIABLE, KEEP_LAST(100), VOLATILE; lifespan = 5 s so old warnings drop.
- Large messages (PointCloud2, Image): BEST_EFFORT + KEEP_LAST(1) intra-host; switch to RELIABLE only if drops cause downstream failure.
Naming conventions (REP 105 / REP 144)
- Frames —
world,map,odom,base_link,base_footprint,imu_link,camera_link,<sensor>_optical_frame. - Topics —
/cmd_vel,/odom,/scan,/imu/data,/camera/image_raw,/camera/camera_info,/joint_states,/tf,/tf_static. - Services —
/get_X,/set_X,/list_X,/clear_X. - Per-robot namespace —
/robot1/cmd_velfor multi-robot; configure via launch. - Node names — match the package:
wheel_odometry_node,nav2_controller. Keep them short; they appear in graphs.
Workspace layout
A colcon workspace:
my_ws/
├── src/
│ ├── my_package_a/
│ │ ├── package.xml
│ │ ├── CMakeLists.txt
│ │ └── src/...
│ └── my_package_b/
├── build/ # colcon build artefacts
├── install/ # final installed packages (this is sourced)
└── log/ # per-build logs
Overlay/underlay pattern: source /opt/ros/lyrical/setup.bash (the system underlay) → source install/setup.bash (your overlay). The overlay shadows the underlay package-by-package.
colcon build --packages-select my_package_a --symlink-install
source install/setup.bash
ros2 launch my_package_a bringup.launch.py--symlink-install makes Python and config files editable without rebuilding.
Real-time
- Kernel: PREEMPT_RT patch (mainlined incrementally since 6.12; full RT in mainline as of 6.13). Use a real-time kernel image (Ubuntu Pro RT).
- CPU isolation:
isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3on the kernel command line. Pin RT threads to those CPUs withsched_setaffinity. - Scheduling:
SCHED_FIFOpriority 80–95 for control threads; let GUI / logging stay SCHED_OTHER. - Memory: pre-allocate publishers’ message pools (
rclcpp::PublisherOptions::allocator); usemlockall(MCL_CURRENT | MCL_FUTURE)to lock pages; avoid heap in the RT path. - Executor:
StaticSingleThreadedExecutororEventsExecutor; pin withsched_setaffinity. - RMW: Iceoryx for zero-jitter shm; Cyclone DDS with its
network_interface_addresspinned otherwise. - Result: 100 µs–250 µs worst-case loop jitter is achievable on commodity x86 with PREEMPT_RT; sub-50 µs needs a Xenomai or Codesys-grade stack outside ROS 2.
Embedded / micro-ROS
For MCUs (Cortex-M, ESP32, Renesas RA, Zephyr boards), full DDS is too heavy (≥ 30 MB RAM). micro-ROS uses DDS-XRCE (Extremely Resource-Constrained Environment), a thin client protocol → an “agent” on a Linux host that bridges to full DDS. Memory footprint: < 100 KB RAM, < 200 KB flash. Supports rclc (C-only client library), publishers/subscribers/services, parameters, and (basic) lifecycle. Hooks into FreeRTOS / Zephyr / NuttX / bare-metal.
Pattern: STM32H7 motor controller runs micro-ROS publishing /joint_states and subscribing /joint_commands over USB-CDC or Ethernet → micro-ROS agent on the robot’s onboard Linux SBC bridges to the full DDS graph.
Security
DDS-Security plugin (SROS2) provides:
- Authentication — X.509 certificates per node identity.
- Access control — XML governance + permissions files restrict who can publish/subscribe to which topics.
- Cryptographic protection — AES-GCM payload encryption.
Generate keystore with ros2 security create_keystore; set ROS_SECURITY_ENABLE=true ROS_SECURITY_STRATEGY=Enforce. Cost: ~5–15% extra latency, certificate management overhead, increased complexity in fleet rollout. Required by IEC 62443 for industrial deployments; optional for closed networks.
Multi-machine and fleets
ROS_DOMAIN_ID(int 0–101 useful, 0–232 valid) partitions DDS traffic at the wire level. Different domain IDs = different multicast groups = no cross-talk. Always set a non-default domain ID in fleets to avoid accidental cross-vehicle traffic.- Firewall: open UDP 7400 + 7400 + 250*domain + N for SPDP/SEDP discovery; tighter rules require static endpoint config.
- Multi-NIC hosts: pin DDS to a specific interface (
CYCLONEDDS_URI) — otherwise DDS may bind to a docker bridge or VPN interface and silently fail. - Wi-Fi: multicast on Wi-Fi is unreliable; use Fast DDS Discovery Server (centralised) or static endpoint config.
Logging and tracing
- Logging:
RCLCPP_INFO(get_logger(), "fmt", args…); structured via the spdlog backend; per-node severity tuning. Output to console + log files in~/.ros/log/. - Tracing:
tracetoolsintegrates with LTTng; record at microsecond resolution which callback fired when, executor scheduling, DDS publish/receive — essential for diagnosing executor-induced jitter. - rosbag2 records arbitrary topics to MCAP (preferred since Iron) or sqlite3. Replay with
ros2 bag playrewinds the clock ifuse_sim_time=true.
Time
rclcpp::Timeandrclcpp::Clocksupport multiple sources:RCL_SYSTEM_TIME,RCL_ROS_TIME(settable, used by sim),RCL_STEADY_TIME.- Set
use_sim_time:=trueparameter cluster-wide when replaying bags or running Gazebo; all clocks then read from/clock. - Cross-host clock sync: PTP (IEEE 1588) for sub-microsecond; NTP / chrony for ~ms. Required for any sensor fusion that timestamps events on different hosts.
5. Components & sourcing
Real packages, real install lines, real CI patterns.
Core ROS 2
sudo apt install ros-lyrical-ros-base # minimal (no RViz, no demos)
sudo apt install ros-lyrical-desktop-full # RViz + demos + perceptionSystem install path: /opt/ros/lyrical/. Sourced via setup.bash.
RMW packages
ros-lyrical-rmw-cyclonedds-cpp— default, Eclipse Cyclone DDS.ros-lyrical-rmw-fastrtps-cpp— eProsima Fast DDS, swap withexport RMW_IMPLEMENTATION=rmw_fastrtps_cpp.rmw_connextdds— commercial, install via RTI installer; binds to Connext Pro.rmw_iceoryx_cpp— Eclipse Iceoryx; build from source or use Apex/Bosch images.
Build
colcon(pip install -U colcon-common-extensionsorapt install python3-colcon-common-extensions).rosdepfor system-package resolution (rosdep install --from-paths src --ignore-src -y).vcstool(vcs import < repos.yaml) for multi-repo workspaces.
Visualization
- RViz 2 (
ros-lyrical-rviz2) — the canonical 3D viewer; displays for TF, occupancy grids, point clouds, markers, robot model. - Foxglove Studio (foxglove.dev) — modern web/desktop viewer; MCAP-native, far better at large bags than RViz; free desktop, paid team cloud.
- PlotJuggler (
ros-lyrical-plotjuggler-ros) — multi-topic time-series plotter; the standard for tuning controllers and reviewing bag data. - rqt (
ros-lyrical-rqt-common-plugins) — Qt-based GUI dispatcher; rqt_graph, rqt_topic, rqt_console, rqt_reconfigure.
Bag and replay
rosbag2(ros-lyrical-ros2bag+ros-lyrical-rosbag2-storage-mcap) — MCAP is the recommended storage as of Iron+.mcapCLI (Foxglove) —mcap info,mcap convert,mcap merge— works on bags outside ROS.
micro-ROS
micro-ROS/micro_ros_setupGitHub — installer for ESP32, STM32, Renesas RA, Zephyr, NuttX, FreeRTOS targets.- STM32CubeMX / IDE component; ESP-IDF component; Zephyr west module.
- Agent:
micro_ros_agentruns on the Linux host (USB-CDC, UDP, serial, or CAN-XL transports).
Navigation and manipulation
- Nav2 (
ros-lyrical-navigation2) — Steve Macenski et al.’s production navigation stack: planner, controller, costmap, behaviour-tree executor, recoveries. slam_toolbox— Steve Macenski’s 2D LiDAR SLAM; the default 2D map builder under Nav2.- MoveIt 2 (
ros-lyrical-moveit) — arm motion planning, kinematics, Ruckig-based trajectory smoothing; PickNik commercial support. tf2_ros— coordinate frame transforms (broadcaster + listener + buffer).
Drivers (200+ packages)
Pattern: ros-${distro}-${vendor}-${device}. Examples:
ros-lyrical-velodyne— Velodyne LiDAR (VLP-16/32).ros-lyrical-livox-ros-driver2— Livox Mid-360 / Avia.ros-lyrical-realsense2-camera— Intel RealSense D435/D455.ros-lyrical-zed-ros2-wrapper— Stereolabs ZED 2/X.ros-lyrical-vesc— VESC motor controllers (mobile bases).ros-lyrical-ur-robot-driver— Universal Robots UR3e/5e/10e/16e/20.ros-lyrical-franka-ros2— Franka Emika Panda / Research 3.
Simulators
- Gazebo Sim (gz-harmonic / gz-ionic) — the new modular Gazebo; integrates via
ros_gz_bridge. - Nvidia Isaac Sim +
isaac_rospackages — GPU-accelerated, photoreal, USD scenes. - Webots — open-source, free;
webots_ros2. - MuJoCo MJX — fast for legged/contact-rich;
mujoco_ros2_control.
Diagnostics
ros2 doctor— health check (RMW match, network, packages).ros2 topic hz / bw / delay / echo— runtime introspection.performance_metricsandperformance_test— benchmark RMW latency / throughput.tracetools+tracetools_analysis— LTTng tracing of executor and DDS.
IDE
- VS Code + Microsoft “ROS” extension + C++ + Python + URDF preview.
- CLion + JetBrains ROS plugin (community).
- Vim/Emacs +
clangd+pylsp+ colcon make targets.
6. Reference data
Distro EOL matrix (as of 2026-05)
| Distro | Released | EOL | LTS | Default RMW | Ubuntu |
|---|---|---|---|---|---|
| Foxy | 2020-05 | 2023-05 | Y | Fast DDS | 20.04 |
| Galactic | 2021-05 | 2022-12 | N | Cyclone | 20.04 |
| Humble | 2022-05 | 2027-05 | Y | Fast DDS | 22.04 |
| Iron | 2023-05 | 2024-11 | N | Cyclone | 22.04 |
| Jazzy | 2024-05 | 2029-05 | Y | Fast DDS | 24.04 |
| Kilted | 2025-06 | 2027-05 | N | Cyclone | 24.04 |
| Lyrical Luth | 2026-05 | 2031-05 | Y | Cyclone | 24.04 / 26.04 |
RMW comparison
| RMW | License | Intra-host latency | Cross-host | Shm? | Security | Best for |
|---|---|---|---|---|---|---|
| Cyclone DDS | EPL/EDL | ~150 µs UDP | yes | optional | yes | default; broad fleet |
| Fast DDS | Apache 2 | ~30 µs (shm transport) | yes | yes (built-in) | yes | discovery-server fleets, certification path |
| RTI Connext | Commercial | ~80 µs | yes | yes | DO-178C, IEC 61508 | safety-critical aerospace/defence |
| Iceoryx | Apache 2 | < 10 µs (true zero-copy) | no (intra-host only) | yes (only mode) | no | tight RT pipelines on one host |
QoS policy reference
| Policy | Default profile | Match rule |
|---|---|---|
| reliability | RELIABLE | producer ≥ consumer (R > BE) |
| durability | VOLATILE | producer ≥ consumer (TL > V) |
| deadline | Infinite | producer offered ≤ consumer requested |
| liveliness | AUTOMATIC | producer ≥ consumer (auto > manual_node > manual_topic) |
| lifespan | Infinite | producer offered ≤ consumer requested |
| history | KEEP_LAST(10) | not part of matching, but bounds buffer |
| depth | 10 | local buffer size |
Common packages by domain
| Domain | Packages |
|---|---|
| Perception | image_transport, cv_bridge, pcl_ros, depthai-ros, vision_msgs |
| State estimation | robot_localization (EKF/UKF for /odom→/map), imu_filter_madgwick, gtsam_ros2 |
| SLAM | slam_toolbox, cartographer_ros, rtabmap_ros, lio_sam |
| Planning | nav2_* (bringup, planner, controller, behavior_tree), moveit2, pilz_industrial_motion_planner |
| Control | ros2_control, ros2_controllers (joint_trajectory, diff_drive, ackermann, mecanum), control_toolbox |
| Drivers | ros2_canopen, ros2_socketcan, serial_driver, joystick_drivers, usb_cam, realsense2_camera |
| Simulation | ros_gz, gazebo_ros2_control, isaac_ros_*, webots_ros2_* |
| Tools | rqt_*, rviz2, plotjuggler, foxglove_bridge, rosbag2, tracetools |
Topic conventions (REP 105 / REP 144 / REP 145)
| Topic | Type | Frame | Rate |
|---|---|---|---|
/tf | tf2_msgs/TFMessage | mixed | event-driven |
/tf_static | tf2_msgs/TFMessage | static | latched |
/cmd_vel | geometry_msgs/Twist | base_link | 10–100 Hz |
/odom | nav_msgs/Odometry | odom→base_link | 50–100 Hz |
/joint_states | sensor_msgs/JointState | joint frames | 50–1000 Hz |
/imu/data | sensor_msgs/Imu | imu_link | 100–1000 Hz |
/scan | sensor_msgs/LaserScan | laser_frame | 10–40 Hz |
/camera/image_raw | sensor_msgs/Image | camera_optical_frame | 30–60 Hz |
/camera/camera_info | sensor_msgs/CameraInfo | camera_optical_frame | 30–60 Hz |
/diagnostics | diagnostic_msgs/DiagnosticArray | — | 1 Hz |
7. Failure modes and debugging
Topic-graph mismatches
- Topic name typo / wrong remap →
ros2 topic listshows the topic, butros2 node info <node>shows it’s bound to the wrong name. Fix in the launch file’sremappings=[('/old', '/new')]. - QoS incompatibility → publisher and subscriber both exist on the right topic, but no messages flow.
ros2 topic info /topic --verboseshows the offered vs requested QoS; the Endpoint Type column flags the incompatibility. RELIABLE vs BEST_EFFORT is the #1 culprit; durability is #2. - Type mismatch → silent on the same topic name with different message types;
ros2 topic infoshows the type registered.
Discovery problems
- No nodes seen across hosts → check
ROS_DOMAIN_IDmatches; checkRMW_IMPLEMENTATIONmatches; check firewall (UDP 7400 + 250*domain). Runros2 multicast receiveon one host andros2 multicast sendon the other. - Discovery storm in fleets (50+ nodes) → switch to Fast DDS Discovery Server mode, which centralises discovery and removes the O(N²) multicast spam.
- Nodes appear briefly then vanish → liveliness deadline misconfigured; or DDS participant being created and destroyed by composition.
- Wrong NIC bound → docker0 / VPN interface intercepts multicast. Pin with
CYCLONEDDS_URI=<XML with interface=eth0>orFASTRTPS_DEFAULT_PROFILES_FILE.
Time and timing
use_sim_timemismatch across nodes → some nodes use wall time, others use/clock; transforms timestamped in disagreeing clocks fail tf2 lookups. Always set the parameter cluster-wide (in the YAML, not just on individual nodes).- Cross-host clock drift → if NTP only, expect ±5–50 ms; for tight sensor fusion you need PTP (chrony-ptp or
ptp4l+phc2sys). - Bag replay rate drift → if the disk is slow, bag plays back slower than real-time;
ros2 bag play --rate 1.0shows actual achieved rate. Use NVMe storage for high-rate bags.
Lifecycle problems
- Lifecycle deadlock →
on_configurethrows or never returns; transition blocks indefinitely. Wrap callbacks in try/catch returningTRANSITION_CALLBACK_FAILURE; add a 30 s timeout in the lifecycle manager. - Nav2 nodes won’t transition → check the lifecycle manager’s bringup order (
Nav2 LifecycleManagerbrings up nodes in the order listed; map_server before amcl before planner before controller).
Memory and buffers
- Slow subscriber, growing memory → KEEP_ALL durability with no consumer-side flow control. Switch to KEEP_LAST(N) with a tight depth.
- Late-join sees nothing on /tf_static →
/tf_staticmust be TRANSIENT_LOCAL on both ends; the default profile intf2_ros::StaticTransformBroadcasteris correct, but custom publishers often miss this. - Memory leaks in long runs → use
valgrind --tool=memcheck(slow) orheaptrack; the usual offenders are message buffers in keep-all, or rclpy GIL-locked references.
Performance
- High CPU at idle → spin loop in a custom node, often a
while(rclcpp::ok())with no sleep. Useexecutor.spin(), notspin_somein a busy loop. - Latency spikes → page faults from heap allocation in the RT path. Pre-allocate;
mlockall; use real-time-safe loggers. - Python rclpy slow → GIL serialises callbacks; high-rate topics in Python lose messages. Move the hot path to C++; keep Python for orchestration and config.
spin_somestarvation → in a custom loop callingexecutor.spin_some(), slow callbacks crowd out fast ones. UseMultiThreadedExecutorwith reentrant callback groups.
Build
- Stale colcon cache →
colcon build --packages-select X --cmake-clean-cache; if that fails,rm -rf build/X install/Xand rebuild. Bloom-released package conflict → mixing apt-installed packages with workspace-built ones of the same name; the overlay wins for symbols but the cmake export may pick up the wrong path.colcon list --topological-ordershows the build order.
Composition / executor
- Composed nodes share a fault — a segfault in any of 10 nodes in a
ComponentContainerkills the container. Separate safety-critical from experimental code. spin_somere-entrancy — callingspin_somefrom inside a callback dispatched by the same executor → undefined behaviour. Use a different executor or callback group.
Inter-RMW
- Mixed RMW deployment (Cyclone on robot, Fast DDS on workstation) → topic graph appears empty across the boundary, sometimes partially populated. All hosts in a domain must use the same RMW family in practice. The community gateway proposal is still draft as of 2026-05.
8. Case studies
TurtleBot 4 — the ROS 2 reference design (Clearpath / iRobot 2022)
The TurtleBot 4 is the canonical first ROS 2 robot for any new lab: a Create 3 mobile base + a Raspberry Pi 4 (Standard) or RPi 4 + Oak-D camera (Pro), Lyrical-Luth-supported stock as of 2026-05. The architecture is a faithful exemplar of “what ROS 2 looks like on a real robot”:
- Create 3 firmware speaks CRC32-framed UDP to the RPi running the ROS 2 stack — but exposes itself as a set of ROS 2 nodes via an onboard micro-ROS firmware (the Create 3 ships with micro-ROS on its STM32 MCU). The driver topics
/cmd_vel,/odom,/imu,/wheel_statusare first-class DDS topics. - Compute: RPi 4 (4 GB) running Ubuntu 22.04 + Humble or Ubuntu 24.04 + Jazzy; Cyclone DDS.
- Stack:
slam_toolboxfor 2D mapping → Nav2 for path planning + controller, all lifecycle-managed undernav2_lifecycle_manager. - Why it matters: it bundles every “production” ROS 2 pattern — micro-ROS at the actuator boundary, lifecycle-managed Nav2, MCAP bagging, Foxglove or RViz for visualisation, all reachable from a laptop on the same Wi-Fi (Discovery Server config is the documented workaround for flaky multicast).
The TurtleBot 4 has become the de facto syllabus reference for university robotics courses since 2023 — when a student asks “how do I make my robot do X?”, the answer is usually “look at how TurtleBot 4 does X, then change one node.”
Boston Dynamics Spot — community ROS 2 bridge (bdaiinstitute/spot_ros2)
Spot’s official SDK is gRPC + Protobuf, not ROS 2 — Boston Dynamics ships it that way for backward compatibility with non-ROS customers. The community-maintained spot_ros2 driver wraps the SDK:
- A
spot_driverlifecycle node holds the gRPC client → robot’s onboard compute over Wi-Fi or Ethernet (PoE on Spot Enterprise). - Publishes the standard ROS 2 topics:
/odom,/tf(full kinematic tree from URDF),/joint_states,/cameras/<location>/image,/depth/<location>/image,/lidar/points(Velodyne VLP-16 on Spot Enterprise). - Provides action servers for
Sit,Stand,WalkTo,ExecuteBehavior— wrapping BD’s choreographer. - Why it matters: it’s the working pattern for “company has a proprietary SDK, we want it in the ROS 2 graph” — write a lifecycle node bridge with the proprietary client inside and ROS 2 publishers/services/actions outside. The same pattern shows up for ABB IRC5 (
abb_robot_driver), KUKA KRC (kuka_drivers), Fanuc R-30iB (fanuc_driver).
The bridge is GPLv3 / Apache 2; Boston Dynamics has explicitly endorsed it for research and educational use; integration with Nav2 and MoveIt 2 works out of the box for navigation but requires custom controllers for the legged manipulator.
Autoware Universe + ros2_control on autonomous vehicles (Tier IV, Apex.AI)
Autoware Universe is the open-source autonomous-driving stack maintained by the Autoware Foundation (Tier IV, Apex.AI, Arm, Toyota, others). It is the largest ROS 2 deployment by code volume — 1.2 M LOC as of 2026-05, spanning perception, prediction, planning, control. Running on Tier IV’s robobus pilots in Tokyo and Nagoya since 2022.
- RMW: Apex.OS (Apex.AI’s safety-certified fork of ROS 2 + Cyclone DDS), or Fast DDS in non-certified deployments.
- Compute: Nvidia Drive Orin (~250 TOPS) or industrial PC (i7/i9 + RTX GPU); on cars with redundancy, two compute nodes synchronised via a watchdog (one primary, one hot-spare).
- ros2_control hardware interface drives the vehicle (throttle, brake, steering) via CAN:
autoware_universe::vehicle_interfaceexposes/control/command/control_cmdand/vehicle/status/*; each vehicle vendor implements thehardware_interface::SystemInterfacefor their drive-by-wire kit (Pacifica Pacmod, Lexus RX450h custom, Aichi MOC). - Lifecycle: every Autoware subsystem (Perception, Localisation, Planning, Control) is a lifecycle group; the supervisor enforces dependency order at startup.
- Tracing: LTTng + tracetools, integrated into Autoware’s CI to detect callback regressions ≥ 50 µs.
- Real-time guarantee: 100 Hz control loop hard deadline; planner runs 10 Hz; perception 10–20 Hz. Achieved with PREEMPT_RT + Iceoryx intra-process for the perception → prediction → planner chain.
Why it matters: this is the upper envelope of what ROS 2 supports — life-safety-critical software, multi-vendor hardware, certification trace, deployed at city scale. Apex.OS’s safety case for ROS 2 is the public proof that the architecture is fit for ISO 26262 ASIL D adjacency (Apex.OS itself is ASIL D; mainline ROS 2 is QM with engineering controls).
9. Cross-references
Robotics (this library):
[[Robotics/comm-buses]]— companion systems note on the physical-layer buses (CAN, EtherCAT, I²C, SPI, UART, USB) under the ROS 2 driver/hardware-interface layer.[[Robotics/pid-control]]— ros2_control’s controller plugins and the discrete PID loops they run.[[Robotics/slam]]—slam_toolbox,cartographer_ros,rtabmap_ros,lio_samare all ROS 2 nodes consuming/scan,/imu/data,/odom.[[Robotics/path-planning]]— Nav2’s global / local planners, costmaps, behaviour-tree executor.[[Robotics/trajectory-generation]]— MoveIt 2’s Ruckig-based trajectory smoother sits inside a JointTrajectoryController.[[Robotics/manipulator-design]]— URDF + Xacro robot descriptions, ros2_control wiring of arms.[[Robotics/mobile-base-wheeled]]—diff_drive_controller,ackermann_controller,mecanum_drive_controllerinros2_controllers.[[Robotics/sensors-perception]]— camera, LiDAR, depth-sensor drivers and their ROS 2 message conventions.[[Robotics/sensors-pose-motion]]— IMU, encoder, GPS drivers publishing/imu/data,/joint_states,/gps/fix.- planned
[[Robotics/safety-standards]]— IEC 61508 / ISO 26262 / ISO 13849 path for ROS 2 (Apex.OS, mainline + engineering controls).
Engineering (foundations):
[[Engineering/realtime-embedded]]— PREEMPT_RT, SCHED_FIFO, RTOS schedulers, jitter analysis underlying ROS 2’s RT story.[[Engineering/microcontrollers]]— STM32 / Renesas RA / ESP32 / RP2040, where micro-ROS runs.[[Engineering/digital-control]]— sample rates, aliasing, Tustin discretization for the control loops inside ros2_control.- planned
[[Engineering/Tier3/connector-families]]— UDP, multicast, NIC tuning, PTP/NTP that DDS rides on. - planned
[[Engineering/realtime-embedded]]— consensus, time sync, fault tolerance for multi-robot fleets.
Languages:
[[Languages/Tier3/ros2-robotics-config]]— DSL surface: URDF, Xacro, behaviour-tree XML, launch.py, rclcpp/rclpy idioms, package.xml, CMakeLists.txt patterns.[[Languages/Tier3/robotics-control]]— ros2_control plugin manifest, controller_manager YAML, the spawner script.[[Languages/Tier3/3d-scene]]— point-cloud and mesh formats consumed by RViz and Foxglove.
10. Citations
Foundational papers
- Macenski, S., Foote, T., Gerkey, B., Lalancette, C. & Woodall, W. (2022). “Robot Operating System 2: Design, Architecture, and Uses In The Wild.” Science Robotics 7(66), eabm6074. DOI 10.1126/scirobotics.abm6074. The canonical ROS 2 design paper.
- Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R. & Ng, A. (2009). “ROS: an open-source Robot Operating System.” ICRA Workshop on Open Source Software. The original ROS 1 paper; concept lineage.
- Maruyama, Y., Kato, S. & Azumi, T. (2016). “Exploring the performance of ROS2.” EMSOFT 2016, 5:1–5:10. DOI 10.1145/2968478.2968502. Early ROS 2 latency / jitter measurements.
- Casini, D., Blass, T., Lütkebohle, I. & Brandenburg, B. (2019). “Response-Time Analysis of ROS 2 Processing Chains under Reservation-Based Scheduling.” ECRTS 2019. Formal analysis of ROS 2 executor scheduling.
- Pham, T. P., Kounev, S. et al. (2022). “Reactive Microservices and ROS 2: Performance Insights.” ICAS 2022. Throughput benchmarking across RMWs.
Standards and REPs
- OMG (2015). Data Distribution Service for Real-time Systems, Version 1.4. https://www.omg.org/spec/DDS/1.4/.
- OMG (2024). Real-Time Publish-Subscribe Protocol DDS Interoperability Wire Protocol, Version 2.5 (DDSI-RTPS). https://www.omg.org/spec/DDSI-RTPS/2.5/.
- ROS Enhancement Proposals: REP 105 (Coordinate Frames for Mobile Platforms), REP 144 (ROS Package Naming), REP 145 (Conventions for IMU Sensor Drivers), REP 2003 (ROS 2 Policies), REP 2007 (Managed Lifecycle Nodes), REP 2014 (Hardware Interfaces and Controllers for ros2_control). https://ros.org/reps/.
- ISO 26262:2018. Road vehicles — Functional safety. Relevant to Apex.OS and Autoware certification.
- IEC 61508:2010. Functional safety of electrical/electronic/programmable electronic safety-related systems.
Books
- Jiménez, F. M. R. (2023). A Concise Introduction to Robot Programming with ROS 2. Chapman & Hall / CRC. The current short text.
- Newman, W. S. (2024). A Systematic Approach to Learning Robot Programming with ROS 2 (2nd ed.). CRC Press. Comprehensive textbook.
- Quigley, M., Gerkey, B. & Smart, W. (2015). Programming Robots with ROS. O’Reilly. ROS 1 era; concepts mostly transfer.
- Mahtani, A., Sanchez, L., Fernandez, E. & Martinez, A. (2018). ROS Robotics Projects (2nd ed.). Packt.
Software documentation
- ROS 2 official documentation — https://docs.ros.org/en/lyrical/ (and per-distro mirrors).
- ROS 2 release timeline — https://docs.ros.org/en/rolling/Releases.html.
- ros2_control documentation — https://control.ros.org/.
- Nav2 documentation — https://navigation.ros.org/.
- MoveIt 2 documentation — https://moveit.picknik.ai/.
- Eclipse Cyclone DDS — https://cyclonedds.io/.
- eProsima Fast DDS — https://fast-dds.docs.eprosima.com/.
- RTI Connext DDS — https://www.rti.com/products/connext-dds-professional.
- Eclipse Iceoryx — https://iceoryx.io/.
- Apex.OS — https://www.apex.ai/apex-os.
- Autoware — https://autoware.org/.
- micro-ROS — https://micro.ros.org/.
- Foxglove — https://foxglove.dev/.
- MCAP file format — https://mcap.dev/.
Open Robotics primary refs
- Open Robotics REP index — https://www.ros.org/reps/rep-0000.html.
- ROS 2 design pages (architecture rationale) — https://design.ros2.org/.
- Open Robotics blog (Macenski, Lalancette posts on Discovery Server, composition, lifecycle) — https://www.openrobotics.org/blog.
Session log appended via node ~/.claude/bin/obsidian-research.mjs log "Built Robotics/ros2-architecture.md Tier 1 deep note".