Lock-Free Data Structures & RDMA

The frontier of single-machine and multi-machine performance has, for two decades, been about the same thing: avoiding the kernel, avoiding waits, and getting closer to bare silicon. On a single node this means lock-free / wait-free concurrent data structures built on atomic primitives. Across nodes it means kernel-bypass networking — DPDK in userspace and RDMA on the NIC itself.

By 2026 the lessons of these fields have moved into mainstream territory: HPC and AI clusters are dominated by InfiniBand/NDR and RoCE; financial trading runs on Solarflare/AMD AOC ef_vi or DPDK; key-value stores like Aerospike, ScyllaDB, and SingleStore lean on lock-free primitives and RDMA. The accelerator era (NVLink, CXL, UALink, UCIe) is the natural continuation — even sub-microsecond network hops are too slow when you can talk over a 900 GB/s coherent link.

This note catalogs the theory, the algorithms, the libraries, and the hardware.

1. The starting line — locks

Before lock-free, the standard concurrency primitives:

  • Mutexpthread_mutex_t, std::mutex. Uncontended acquire on x86 Linux is roughly 25-50 ns (a single locked CAS plus a few cache-coherence round-trips); contended acquire devolves into a kernel futex syscall (~1-10 μs) and an eventual context switch (~3-10 μs). Modern futex-based mutexes spin briefly before sleeping.
  • Spinlock — busy-wait via a lock xchg or lock cmpxchg loop. Better than mutex when the critical section is genuinely short (~< 1 μs) and the thread will not be preempted while holding the lock. In the Linux kernel spinlock_t always disables preemption while held.
  • Ticket lock, MCS lock, CLH lock — FIFO-fair spinlocks. MCS (Mellor-Crummey-Scott 1991) and CLH (Craig-Landin-Hagersten 1993) provide local-spinning, scalable behavior on NUMA. The Linux kernel qspinlock (2014, Long-Peters-Wang) is an MCS variant.
  • Reader-writer lockspthread_rwlock_t, std::shared_mutex. Allow multiple readers or one writer. Performance is worse than expected at high reader count because the read-side still touches a shared cacheline.
  • Condition variables, semaphores, barriers — higher-level synchronization built on the same primitives.

Standard threading APIs: POSIX pthreads, C11 <threads.h> (thrd_create), C++11 std::thread, Windows API, Win32 Slim Reader-Writer locks, Java java.util.concurrent.

2. Memory ordering — the model below the language

Modern hardware does not deliver sequentially consistent memory by default. Atomics and ordering are how a programmer regains control.

2.1 Models

  • Sequential consistency (SC) — Leslie Lamport 1979, “How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs”. There exists a total interleaving of all operations consistent with each thread’s program order. The simplest mental model, but expensive to implement.
  • x86-TSO — the x86 actual model, formalized by Sewell, Sarkar, Owens et al. (POPL 2009-2010). Effectively SC plus a per-CPU store buffer that allows a later load to bypass an earlier store. Almost all common idioms (Dekker’s algorithm, etc.) require an mfence to be correct.
  • ARMv8 / RISC-V relaxed (RVWMO) — substantially weaker. Stores and loads can be reordered freely except where ordering primitives prohibit it. Synchronization uses load-acquire / store-release instructions (ldar, stlr on ARMv8; lr.aq / sc.rl on RISC-V) or explicit fences (dmb ish, dsb ish, fence rw,rw).
  • PowerPC — also relaxed; uses lwsync, sync, eieio.

2.2 The C/C++/Rust memory model

C++11 introduced a portable memory model on top of these hardware models — defined in <atomic>, mirrored by C11, Rust core::sync::atomic, Java since JSR-133 (2004).

std::memory_order enum:

  • relaxed — atomicity only, no ordering.
  • consume — data-dependency ordering. Practically always implemented as acquire because compilers cannot reliably track dependencies. Effectively deprecated in C++17/20 wording.
  • acquire — load fences subsequent operations on this thread.
  • release — store fences preceding operations.
  • acq_rel — for RMW operations; both.
  • seq_cst — full sequential consistency. Default. On x86 means lock prefix or mfence after a store.

The fundamental rule: a store-release on location L synchronizes with a load-acquire that reads the released value from L. This creates a happens-before edge between the two threads — everything sequenced before the release in thread A is visible to anything sequenced after the acquire in thread B.

3. Atomic operations

Hardware atomic primitives by ISA:

  • x86lock cmpxchg (CAS), lock xadd (atomic fetch-add), lock xchg (swap), lock cmpxchg16b (double-word CAS on 16-byte aligned). All carry implicit full barriers.
  • ARMv8 — Load-Linked / Store-Conditional (ldxr / stxr) loop; ARMv8.1 added single-instruction LSE atomics (casal, ldadd, swp).
  • RISC-Vlr.w / sc.w LL/SC loop; A extension also provides AMO instructions (amoadd.w, amoswap.w).
  • PowerPClwarx / stwcx. LL/SC.

Software primitives across the major languages:

  • C/C++: std::atomic<T>; operations load, store, exchange, compare_exchange_weak, compare_exchange_strong, fetch_add, fetch_and, fetch_or, fetch_xor.
  • Java: java.util.concurrent.atomic.AtomicInteger / AtomicReference / VarHandle.
  • Rust: core::sync::atomic::{AtomicUsize, AtomicPtr, AtomicI32, ...}.

3.1 CAS vs LL/SC

  • CAS is what x86 has natively. The classic ABA problem: between read and CAS, another thread changes A → B → A. CAS thinks nothing changed, but in fact the pointer (or version) referenced different state in between.
  • LL/SC as on ARM/PowerPC/RISC-V doesn’t intrinsically suffer ABA — the SC fails if any write touched the cacheline. But LL/SC has its own pathologies (spurious failures, context-switch sensitivity), and in practice modern ARM CPUs implement LSE single-instruction atomics that effectively are CAS.

3.2 Double CAS and friends

  • DCAS / DWCAS — double-word CAS, cmpxchg16b on x86-64 (16 bytes); used for tagged pointers (pointer + version counter together).
  • MCAS — multi-word CAS, software-emulated via descriptors (Harris-Fraser-Pratt 2002).
  • TX/TSX — Intel Transactional Synchronization Extensions (Haswell 2013); HLE / RTM. After multiple security vulnerabilities (TAA) and microarchitectural issues, broadly disabled by microcode on most modern Intel parts.

4. The progress hierarchy

Maurice Herlihy formalized progress conditions for concurrent objects in his 1988 paper “Impossibility and Universality Results for Wait-Free Synchronization”.

  • Wait-free — every operation by every thread completes in a bounded number of its own steps regardless of contention. The strongest guarantee.
  • Lock-free — at any moment some thread is making progress. An individual thread may starve, but the system never gets stuck.
  • Obstruction-free — a thread executing in isolation (no contention) will complete in finite steps. The weakest non-blocking property.

In practice lock-free is the usual target — wait-free is often achievable but at significant constant-factor cost.

5. Classic lock-free algorithms

5.1 Stacks and queues

  • Treiber stack (1986). The canonical lock-free stack: head is an atomic pointer; push is a CAS-loop installing a new head whose next is the prior head; pop is a CAS-loop atomically swinging head to head->next. ABA is the primary hazard.
  • Michael-Scott queue (Michael-Scott PODC 1996). Foundational lock-free FIFO. Uses two pointers, head and tail, with CAS-based enqueue and dequeue and a “help the laggard” tail-update step. Java’s ConcurrentLinkedQueue is essentially M-S.
  • LCRQ — Linked Concurrent Ring Queues (Morrison-Afek 2013). High-throughput lock-free queue based on linked rings; outperforms M-S in throughput.
  • MPMCQueue — Vyukov / Folly. Bounded multi-producer multi-consumer ring buffer.

5.2 Hash tables

  • Cliff Click’s NonBlockingHashMap (2007). A lock-free hash table for Java; widely used in JVM-based databases (Cassandra historically used variants).
  • Hopscotch hashing — Herlihy-Shavit-Tzafrir, DISC 2008. Open-addressing hash with bounded probe distance; cache-friendly; concurrent variant.
  • Junction — Jeff Preshing 2016; open-source C++ lock-free hash tables.
  • folly::ConcurrentHashMap, F14, AtomicHashMap — Facebook.
  • Java ConcurrentHashMap — Doug Lea; redesigned in JDK 8 to use CAS + tree bins.

5.3 Skip lists and trees

  • Concurrent skip lists — Sundell-Tsigas 2003. Lock-free skip list. Java’s ConcurrentSkipListMap (Lea, JDK 6) is the canonical implementation.
  • Lehman-Yao B+-tree (1981). Lock-coupling concurrent B+-tree, the basis of nearly every concurrent B-tree since.
  • OLC — Optimistic Lock Coupling — Leis-Haubenschild-Kemper-Neumann 2019. Versioning + optimistic read; very fast in practice.
  • Bw-Tree — Levandoski-Lomet-Sengupta, ICDE 2013, Microsoft Research. Latch-free B-tree on top of a mapping table; used in SQL Server Hekaton.
  • Lock-free linked list — Harris, DISC 2001. Two-step deletion with mark-and-then-detach.
  • Lock-free BST — Ellen-Fatourou-Ruppert-van Breugel, PODC 2010.

6. Memory reclamation — the hard part

The single hardest problem with lock-free data structures is when can I free this node? Other threads may still hold pointers into it.

  • Hazard pointers — Maged Michael 2004, “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects”. Each thread publishes the pointers it is currently accessing in a per-thread slot. A retirer scans hazard pointers before freeing. Bounded memory. folly::hazptr is a production-grade implementation.
  • Epoch-based reclamation (EBR) — Keir Fraser, “Practical Lock-Freedom” 2004 (DPhil thesis). Threads enter/leave critical sections; the system tracks a global epoch counter; nodes retired in epoch N are freed once all threads have passed epoch N+2. Used in crossbeam-epoch and many production systems.
  • DEBRA — Distributed Epoch-Based Reclamation — Trevor Brown, PODC 2015. Faster, signal-handling-aware variant.
  • RCU — Read-Copy-Update — Paul McKenney, OLS 1998. The reader path is essentially free; writers copy, update, and wait for a grace period. In the Linux kernel since 2002; foundational for VFS, networking, and many subsystems.
  • URCU — Userspace RCU — Mathieu Desnoyers, “User-Level Implementations of Read-Copy Update” 2012; library widely used in user-space (LTTng, qemu, others).
  • Reference counting — Boost / std::shared_ptr with atomic refcounts. Thread-safe count manipulation, but cycles require weak references. Generally less performant than EBR or hazard pointers for read-heavy concurrent data structures.
  • DRC — Dynamic Region-Based Collection; various academic schemes.

7. High-performance concurrent systems

7.1 LMAX Disruptor

Martin Thompson, Dave Farley, Michael Barker (LMAX Exchange, 2011). “The LMAX Architecture”. A mechanical-sympathy ring buffer with a single-writer principle and producer/consumer sequence barriers. Achieves 100 M+ msgs/s single-threaded on commodity x86 with sub-microsecond latency. Widely copied in financial trading (CME, Deutsche Börse, exchanges generally).

7.2 Standard concurrent libraries

  • Rust crossbeam — Aaron Turon’s crossbeam 2015, now maintained by Stjepan Glavina; provides crossbeam-channel, crossbeam-deque (Chase-Lev work-stealing deque), crossbeam-epoch, lock-free queues, and segmented atomics.
  • Facebook follyMPMCQueue, ProducerConsumerQueue, F14HashMap (cache-friendly hash), ConcurrentHashMap, AtomicHashMap, hazptr, SharedMutex, Future.
  • Intel oneTBB (formerly Threading Building Blocks) — concurrent_hash_map, concurrent_queue, concurrent_vector, work-stealing task scheduler.
  • Java util.concurrent (JSR-166) — Doug Lea, since JDK 5 (2004). ConcurrentHashMap, ConcurrentLinkedQueue, ForkJoinPool, CompletableFuture, StampedLock. The reference design for many other ecosystems.
  • C++ standard parallelismstd::for_each(std::execution::par, ...) in C++17, but real workloads still use TBB or hand-rolled.
  • .NET TPL — Task Parallel Library; work-stealing.
  • Go runtime scheduler — M:N goroutines, work-stealing GC, channels as the primary synchronization primitive.

7.3 Discontinued / niche

  • Cilk Plus — MIT-origin work-stealing extension; Intel acquired and discontinued 2017. Influences live on in OpenMP tasks, TBB, and Go.

8. Kernel-bypass networking — why and how

The socket API has a hard floor on latency: every send/recv is a syscall (~1 μs), the kernel copies the buffer, schedules the NIC, and on receive raises an interrupt. End-to-end one-way kernel TCP/IP localhost is roughly 5-10 μs; same-rack TCP/IP is 20-50 μs.

For HFT (sub-microsecond required), HPC (μs-class collective operations), 5G UPFs, and high-PPS routers, this is unacceptable. Kernel-bypass moves the NIC driver and packet processing into user space.

8.1 DPDK — Data Plane Development Kit

Originated at Intel 2010, open-sourced under the Linux Foundation. Architecture:

  • Poll Mode Drivers (PMDs) — user-space drivers for nearly every modern NIC; the CPU spins polling RX queues instead of being interrupted.
  • Hugepages — 2 MB / 1 GB pages reduce TLB pressure for the packet pool.
  • EAL (Environment Abstraction Layer) — NUMA awareness, lcore affinity, ring buffers, mempools.
  • Vendor PMDs — Intel (igb_uio, vfio-pci ixgbe/i40e/ice), NVIDIA Mellanox (mlx5), Marvell (octeon), Broadcom, Solarflare/AMD.

Used widely in NFV (Open vSwitch DPDK datapath, VPP), 5G mobile cores (UPF, AMF/SMF data plane), F5 BIG-IP newer generations, Aviatrix, AVI Networks. The Linux Foundation’s FD.io VPP (Vector Packet Processing, originating at Cisco) is a major DPDK-based packet processing framework.

8.2 Solarflare ef_vi and OpenOnload

Solarflare (acquired by Xilinx in 2019, in turn acquired by AMD in 2022 as the AMD AOC group) makes ultra-low-latency NICs used by essentially every major HFT desk. Two paths:

  • OpenOnload — kernel-bypass implementation of the socket API. Application uses standard socket()/send()/recv(); library hijacks them. Drop-in.
  • ef_vi — raw verbs-like API for maximum performance.

End-to-end one-way latency is sub-microsecond on the same switch.

8.3 Other kernel-bypass paths

  • Netmap — Luigi Rizzo, ATC 2012. Userspace direct-NIC framework; influences XDP and others.
  • AF_XDP — Linux kernel 4.18 (2018). Userspace ring buffer driven by XDP (eXpress Data Path; eBPF-based) in the kernel. Less invasive than DPDK because the device stays under kernel control; lower performance ceiling but easier to deploy.
  • Snabb / Snabb Switch — Luke Gorrie; Lua-based; influential, less used today.
  • VPP — see Section 8.1.
  • Cloudflare bpfilter / XDP — eBPF used in production for DDoS scrubbing and L4 load balancing.

9. RDMA — Remote Direct Memory Access

RDMA NICs (called HCAs for InfiniBand, RNICs more generally) execute a full transport-layer protocol on the NIC silicon. The application registers a memory region with the NIC and posts work requests. The remote NIC writes (or reads) directly into application memory without waking the remote CPU.

9.1 Transports

  • RC — Reliable Connection — TCP-like; in-order, reliable; the most common. One QP (Queue Pair) per peer pairing.
  • UC — Unreliable Connection — connection-oriented but no reliability.
  • UD — Unreliable Datagram — one QP to many; like UDP. Used for one-to-many setup and metadata.
  • XRC — eXtended Reliable Connection — scales QP count by sharing receive queues among processes; large clusters use it.
  • DCT — Dynamically Connected Transport — Mellanox-specific; further scaling.

9.2 Operations

  • SEND / RECV — two-sided; the receiver must post a receive buffer in advance.
  • WRITE — one-sided; sender writes directly into a registered remote memory region. The remote CPU is not woken (unless the sender requests immediate / completion).
  • READ — one-sided; remote NIC fetches data and DMAs back.
  • ATOMIC — one-sided 8-byte fetch-and-add or compare-and-swap on remote memory. Restricted but very useful for distributed locking and counters.

9.3 Fabrics

  • InfiniBand — Mellanox (now NVIDIA Networking) and historically QLogic/Intel TrueScale. Speeds: SDR 10G → DDR 20G → QDR 40G → FDR 56G → EDR 100G → HDR 200G → NDR 400G → XDR 800G (roadmap). The dominant fabric for HPC Top500 systems and AI training clusters.
  • RoCE — RDMA over Converged Ethernet — v1 (Ethernet L2) / v2 (UDP/IP, routable). Requires PFC (Priority Flow Control) and ECN/DCQCN (Data Center QCN, Zhu et al. SIGCOMM 2015) for the near-lossless fabric RDMA assumes. Microsoft Azure pioneered large-scale RoCEv2 deployment.
  • iWARP — RDMA over TCP; Chelsio championed. Largely supplanted by RoCEv2.

9.4 Latency and throughput

End-to-end half round-trip:

  • Kernel TCP over Ethernet, same rack: 20-50 μs.
  • Kernel TCP localhost: 5-10 μs.
  • DPDK user-space TCP/UDP: 1-3 μs.
  • RDMA RC: 1-3 μs (HDR/NDR class).
  • NVLink (intra-server): ~200 ns.
  • Last-level cache hit: ~10-20 ns.
  • L1 hit: ~1 ns.

9.5 APIs

  • libibverbs — the canonical verbs API. Raw, brutal.
  • librdmacm — RDMA connection management on top of verbs.
  • UCX — Unified Communication X — Mellanox/ORNL/IBM; higher-level; the transport backend for Open MPI, modern MPICH variants.
  • MPI implementations — Open MPI, MPICH, MVAPICH (Dhabaleswar K. Panda, Ohio State), Intel MPI, NVIDIA HPC-X.
  • NCCL — NVIDIA Collective Communications Library; collective ops over NVLink + IB; the default for distributed deep learning.
  • GPUDirect RDMA — NVIDIA’s mechanism for RDMA NICs to DMA directly into GPU memory, bypassing host RAM entirely. Essential for H100/B200 SuperPOD scale.

9.6 Deployments

  • NVIDIA DGX SuperPOD — InfiniBand NDR/XDR backbone.
  • Microsoft Azure HBv4, ND H100 v5, ND H200 v5 — InfiniBand NDR.
  • Meta Grand Teton and Research SuperCluster — RoCEv2 and IB.
  • AWS EFA — Elastic Fabric Adapter — proprietary AWS, AWS-specific SRD (Scalable Reliable Datagram) on top of Nitro. RDMA-like semantics; not standard IBVerbs but accessed through libfabric.
  • GCP A3 / H3 / TPU pods — custom Google interconnect (ICI for TPUs, NVLink + GPUDirect for H100).
  • Top500 supercomputers — vast majority use InfiniBand or HPE Cray Slingshot.

10. Modern interconnects — accelerator era

The 1-3 μs RDMA latency is too slow for tightly-coupled accelerator workloads. The 2020s have produced a wave of coherent and semi-coherent fabrics.

  • NVLink — NVIDIA chip-to-chip and chip-to-switch links. NVLink 4 (Hopper H100): 900 GB/s per GPU. NVLink 5 (Blackwell B200/GB200): 1.8 TB/s per GPU. NVSwitch fabric extends NVLink across racks for SuperPOD scales.
  • NVLink C2C — Grace-Hopper chip-to-chip 900 GB/s coherent link.
  • CXL — Compute Express Link — Intel-initiated 2019, now industry consortium. CXL.io (PCIe), CXL.cache, CXL.mem. Versions 1.0 (2019), 2.0 (2020), 3.0 (2022), 3.1 (2023). Enables coherent device memory, memory expansion, and memory pooling across servers — a foundational technology for composable infrastructure.
  • UALink — Ultra Accelerator Link — AMD-led consortium with Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft (2024). Open competitor to NVLink for accelerator-to-accelerator links.
  • UCIe — Universal Chiplet Interconnect Express — chiplet-to-chiplet, on-package. Backed by all major silicon vendors.
  • HPE Cray Slingshot — proprietary, used in Frontier and El Capitan exascale systems.
  • Tenstorrent Ethernet-based chip-to-chip — uses Ethernet at scale.
  • AWS Trainium/Inferentia NeuronLink — proprietary AWS chip-to-chip.

11. Applications

11.1 Financial trading and exchanges

  • HFT shops: Jane Street, Citadel Securities, Jump Trading, Hudson River Trading, IMC, Optiver, Tower Research. Most run kernel-bypass (Solarflare OpenOnload or DPDK), FPGAs for first-level matching, and aggressive lock-free queues internally.
  • Exchanges: CME Globex, NYSE Pillar, Nasdaq INET, BATS / CBOE, ICE, EuroNext, LMAX (uses Disruptor in production), JPX, HKEX, ASX.
  • Market data: ITCH, OUCH, PITCH, FIX protocols; many decoded by FPGA before reaching software.

11.2 HPC and AI

  • Top500 supercomputers — Frontier (ORNL, HPE Cray, El Capitan), Aurora (Argonne, Intel), Fugaku (Riken, Fujitsu A64FX with Tofu), Lumi, Leonardo.
  • AI training clusters — Meta RSC, Microsoft Azure ND H100 clusters, NVIDIA Eos, xAI Colossus, OpenAI clusters.
  • Frameworks — PyTorch DDP / FSDP, DeepSpeed, Megatron-LM, JAX/XLA, NCCL collectives.

11.3 Distributed databases and storage

  • Google Spanner — paxos + atomic clocks (TrueTime); commodity networking.
  • CockroachDB — Raft + commodity networking.
  • Aerospike, ScyllaDB, SingleStore — heavy lock-free / RDMA optimizations.
  • FoundationDB — Apple; ACID with simulation-based testing.
  • NVMe-oF (NVMe over Fabrics) — RDMA-based remote block storage; underlying many cloud block services.
  • Lustre, GPFS / Spectrum Scale, BeeGFS, WekaFS, VAST Data, DAOS — HPC parallel file systems.

11.4 Caching / key-value

  • Redis, KeyDB, Dragonfly, Garnet (Microsoft 2024) — high-performance KV stores.
  • Memcached — McDipper / Twitter Twemcache variants.
  • mica, FaRM — academic + Microsoft Research RDMA KV stores (Dragojevic et al., NSDI 2014; Lim et al.).

11.5 Networking middleboxes

  • Cilium — eBPF-based service mesh and load balancer.
  • MetalLB, Envoy, BFE, HAProxy with kernel-bypass options.
  • F5 BIG-IP, Citrix ADC, A10 Networks — increasingly DPDK-based.

12. Verification of lock-free code

Concurrent code with relaxed memory is notoriously hard to reason about. The verification tooling:

  • Iris in Coq — separation logic for relaxed memory (Section in formal-verification-and-fuzzing).
  • RustBelt — Jung-Jourdan-Krebbers-Dreyer POPL 2018; proves Rust’s unsafe lock-free primitives are sound.
  • CDSChecker / GenMC / RCMC — model checkers for C/C++ atomics.
  • TLA+ PlusCal — for high-level distributed algorithms.
  • Loom — Rust crate; bounded permutation testing of concurrent code.
  • Linux kernel lockdep, KCSAN, KTSAN — dynamic race detection.
  • Google ThreadSanitizer (TSan) — dynamic data-race detection at runtime.

13. Where to learn

  • Maurice Herlihy and Nir Shavit, The Art of Multiprocessor Programming, 2nd ed. 2020. The textbook.
  • Paul McKenney, Is Parallel Programming Hard, And, If So, What Can You Do About It? — free, open, Linux-kernel oriented.
  • Doug Lea, Concurrent Programming in Java.
  • Anthony Williams, C++ Concurrency in Action.
  • Mara Bos, Rust Atomics and Locks. Modern, focused on Rust’s memory model.
  • Hennessy-Patterson, Computer Architecture: A Quantitative Approach — chapters on memory consistency.
  • DPDK programmer’s guide, RDMA-Aware Programming user manual, NVIDIA UCX documentation.

Conferences: PPoPP, SPAA, PODC, DISC for theory; SIGCOMM, NSDI, OSDI, SOSP for systems; SC (Supercomputing) for HPC.

Adjacent