Concurrency Primitives — Compute Reference

1. At a glance

Concurrency and parallelism are distinct concepts, often conflated. Concurrency (the term as Hoare used it in his 1978 CSP paper, “Communicating Sequential Processes,” CACM) is about structuring a program as multiple independently-executing logical activities — whether or not they actually run simultaneously. Parallelism is about simultaneity: multiple physical compute units (cores, SIMD lanes, GPU threads) doing work at the same wall-clock instant. A single-core machine can be concurrent (interleaved tasks) but not parallel. A vectorized matrix multiply on a GPU is parallel but not concurrent in the CSP sense — there is no interaction between lanes.

Rob Pike’s 2012 talk “Concurrency is not Parallelism” hammered this distinction into Go’s design: concurrency is a way to compose independent computations, parallelism is one implementation strategy for executing them faster.

Why concurrency is hard. Three compounding problems:

  1. Shared mutable state. Two threads reading and writing the same memory without coordination is the root cause of nearly every concurrency bug. The C11 / C++11 standards classify unsynchronized concurrent access as undefined behavior — not a bug with predictable symptoms, but a license for the compiler to emit anything.
  2. Non-determinism. The OS scheduler, hardware reordering, and cache coherence delays mean that the same program can produce different output on different runs, on different machines, or even on the same machine on different days. Reproducing a race is often harder than fixing it.
  3. Observability gap. Debuggers, prints, and assertions all perturb timing. A printf added to diagnose a race can change the schedule enough to mask the bug (“Heisenbug”).

The four dominant models. Every concurrent system uses one or a mix of:

  • Shared memory + locks. Threads share an address space; mutexes, rwlocks, semaphores, and atomics serialize access. The classical POSIX / Java / C++ model. Powerful but error-prone.
  • Message passing. Independent processes (or actors) communicate by sending immutable messages over channels or mailboxes. CSP (Hoare 1978), Actors (Hewitt 1973), Erlang/Elixir, Go, Akka, Pekko.
  • Task-based / async-await. Cooperative scheduling: tasks yield at await points; an event loop or runtime multiplexes them onto OS threads. Python asyncio, JS Promises, Rust futures + tokio, C# Task, Swift async.
  • Data-parallel. Apply the same operation to many data elements in parallel. SIMD (SSE, AVX, NEON, SVE), GPU compute (CUDA, Metal, ROCm), OpenMP parallel for, Rayon par_iter. Often map-reduce in spirit.

The rest of this note walks each primitive bottom-up, then shows the high-level patterns that compose from them.

A note on terminology drift. “Process” in Erlang means an actor; “process” in POSIX means an OS process with its own address space. “Thread” in Java prior to JDK 21 always meant an OS thread; “thread” in Java 21+ might be a virtual thread. “Future” and “promise” are used interchangeably in JavaScript and C++, but in Scala / Java a Future is the read side and a Promise is the write side. “Coroutine” in Kotlin is stackless and one-shot; “coroutine” in Lua is stackful and resumable. Be precise about which language’s vocabulary you’re using when comparing across ecosystems.

The end of single-thread speedup. From the 1970s through about 2005, single-thread performance roughly doubled every 18-24 months (Dennard scaling + microarchitectural improvements). Around 2005 Dennard scaling ended: voltages could no longer scale down, so per-core clock speeds plateaued at ~3-5 GHz. Herb Sutter’s 2005 essay “The Free Lunch Is Over” in Dr. Dobb’s Journal declared that programmers could no longer wait for hardware to make their software faster — they would have to embrace concurrency. That essay marks the cultural turning point where concurrent programming graduated from a specialist niche to a mainstream necessity.

2. Threads (OS-level)

A thread is an independent flow of control that shares its parent process’s address space. The OS scheduler multiplexes threads onto physical cores.

Implementations.

  • POSIX threads (pthreads) — IEEE 1003.1c-1995. The lingua franca on Linux, macOS, BSD. pthread_create, pthread_join, pthread_mutex_lock. NPTL (Native POSIX Thread Library, Ulrich Drepper + Ingo Molnár 2003) replaced LinuxThreads with a 1:1 mapping to kernel tasks via clone(2).
  • Win32 threadsCreateThread, WaitForSingleObject. Different API but the same 1:1 kernel mapping.
  • Java threadsjava.lang.Thread. Since Java 1.3 (2000) backed by native OS threads (green threads were dropped). As of Java 21 (2023), JEP 444 introduced virtual threads (Project Loom) that decouple the Java Thread from the OS thread — an M:N userspace scheduler underneath.
  • Go runtime (M:N) — Goroutines are userspace, scheduled by the Go runtime onto a pool of OS threads (the GOMAXPROCS “M:N” model: M goroutines on N OS threads).

Cost.

  • Stack size: default 1 MB on Linux (configurable via pthread_attr_setstacksize), 8 MB on macOS, 1 MB on Windows. Java defaults to 512 KB - 1 MB depending on platform.
  • Context switch: a kernel-mediated thread context switch costs roughly 1-10 µs on modern x86 — load/save of registers, TLB flush considerations, scheduler bookkeeping. Userspace context switches (goroutines, fibers) cost ~100-500 ns.
  • Scaling: you can comfortably run ~1000-10000 OS threads on Linux before scheduler overhead and memory pressure cause thrash. Beyond that, switch to a userspace model.

Why threads alone fail at high concurrency: a web server with 100k concurrent connections cannot afford one OS thread per connection (100 GB of stack reservation, hours of context-switching per second). That problem drove the rise of event loops and green threads. This is often called the C10K problem (Dan Kegel’s 1999 web page of the same name, updated through the 2000s as ulimit, epoll, kqueue, and IOCP made it tractable on commodity hardware), then succeeded by the C10M problem (Robert Graham, 2013) when 10 million concurrent connections became the new bar.

Thread-local storage (TLS). Each thread can have its own copy of a variable, accessed via __thread / thread_local (C/C++), pthread_key_create / pthread_getspecific (POSIX), ThreadLocal<T> (Java), task_local! in tokio. Useful for per-thread caches, scratch buffers, and tracing context. Caveat: thread-local destructors run at thread exit, and ordering between TLS destructors and process exit is famously delicate (especially in DLLs / shared libraries).

Thread cancellation. POSIX pthread_cancel is largely deprecated as too dangerous — a thread can be cancelled at arbitrary cancellation points, leaving locks held, files unclosed, or invariants broken. The modern equivalent is cooperative cancellation: a stop_token (C++20 std::stop_token), context.Context (Go), CancellationToken (.NET), tokio::select! with an abort channel — the worker periodically checks and exits cleanly. Java’s Thread.interrupt() is the closest the JVM gets, but it only delivers cancellation at certain checkpoints (Thread.sleep, Object.wait, blocking NIO).

3. Green threads, fibers, coroutines

A green thread (also “fiber,” “lightweight thread,” “coroutine”) is a user-space construct scheduled by a runtime rather than the kernel.

  • Go goroutines — start with ~2 KB stack that grows by reallocation. Cheap enough to spawn millions. Scheduled by the Go runtime’s work-stealing scheduler onto OS threads.
  • Rust async tasks — zero-cost futures (compiled to state machines, no per-task heap allocation by default). Scheduled by a runtime (tokio, async-std, smol).
  • Kotlin coroutinessuspend functions compiled to continuation-passing form; scheduled by Dispatchers.Default, IO, Main, etc.
  • Lua coroutines — symmetric coroutine.resume/coroutine.yield. Single-threaded by default; cooperative.
  • Python generators (yield) and asyncio tasks (async def + await).
  • C++20 coroutinesco_await, co_yield, co_return. Stackless, customization-point heavy.
  • Java virtual threads (Loom) — JEP 444, GA in Java 21. Mounted onto carrier (platform) threads; unmount on blocking I/O.

Trade-off: green threads are cheap (low memory, fast switch) but require runtime support and can starve if a single green thread monopolizes a carrier OS thread with a tight CPU loop (no yield points = no preemption, except where the runtime forces it — Go does this with a periodic preemption signal since 1.14).

Stackful vs stackless. Two implementation strategies for green threads:

  • Stackful (Go goroutines, Lua coroutines, Erlang processes, Java virtual threads, fibers in Boost.Context): each task gets its own stack (small but real); a yield/resume saves and restores the stack pointer. Pro: can suspend from any depth, including across foreign C calls. Con: each task has a memory footprint; growth strategies (segmented stacks, copying stacks) add complexity. Go used segmented stacks 1.0-1.2, then switched to contiguous, copying stacks in Go 1.3 (2014) after the “hot split” problem appeared at segment boundaries.
  • Stackless (Rust futures, Kotlin coroutines, C++20 coroutines, Python async def, JavaScript async/await): the compiler transforms the function into a state machine with a finite number of states. Pro: tiny per-task overhead, no stack reservation. Con: cannot suspend from inside a regular (non-async) function call; this is the structural origin of the “function coloring” problem (§10).

The Go runtime scheduler is a work-stealing M:N scheduler: each OS thread (M) has a local run queue (P, “processor”); when its queue empties, it steals from another P. Goroutines yield at function calls, channel ops, GC safe-points, and (since Go 1.14, Carlos Garnacho et al.) asynchronous preemption via OS signals — the latter fixed the long-standing problem of a tight loop with no calls hanging the scheduler.

4. Atomics and memory models

Below mutexes lies the hardware: atomic operations are indivisible memory updates that don’t tear, and memory models specify what reorderings of loads/stores are observable across cores.

4.1 Atomic operations

  • Atomic load / store — never observes a partial value; aligned word-sized accesses on x86 are already atomic by default but the compiler may still reorder them, so std::atomic (C++11) / _Atomic (C11) is still required.
  • Fetch-and-add (fadd) — atomic increment; used for reference counts.
  • Compare-and-swap (CAS, cmpxchg) — atomically compare a memory location to an expected value and, if equal, replace it. Foundation of nearly every lock-free data structure.
  • Load-linked / store-conditional (LL/SC) — ARM ldrex/strex, RISC-V lr.w/sc.w. Store succeeds only if no other core wrote between the LL and SC. Often what x86 CAS desugars to on ARM in compilers.

4.2 Memory orders (C11 / C++11)

The C11 (ISO/IEC 9899:2011) and C++11 (ISO/IEC 14882:2011) standards introduced an explicit memory model with six orderings:

OrderMeaning
memory_order_relaxedAtomicity only; no inter-thread ordering. Use for counters where you don’t need to synchronize other data.
memory_order_consumeData dependency ordering. Effectively unimplemented (compilers promote to acquire).
memory_order_acquireOn a load: subsequent reads/writes can’t be reordered before this load.
memory_order_releaseOn a store: prior reads/writes can’t be reordered after this store.
memory_order_acq_relBoth, for read-modify-write ops (CAS, fetch-add).
memory_order_seq_cstSequential consistency: a single global order observed by all threads. Default for std::atomic.

Acquire/release pairs are the bread and butter: a release-store on a flag publishes everything written before it; a matching acquire-load on the same flag synchronizes-with it and makes those writes visible.

4.3 The Java Memory Model

JSR-133 (Manson, Pugh, Adve 2004) defined Java’s memory model in terms of happens-before: actions in one thread happen-before actions in another iff there’s a chain of synchronization actions linking them. volatile reads/writes act like acquire/release. Final fields, once visible, never appear unset. Before JSR-133, the JLS had hand-wavy semantics that couldn’t be implemented correctly.

4.4 Hardware memory models

  • x86 (Intel + AMD) — Total Store Order (TSO). Loads are not reordered with other loads; stores are not reordered with other stores; loads can be reordered after stores (store buffer). This is strong: almost every algorithm written for sequential consistency works on x86 without explicit fences.
  • ARM (ARMv7, ARMv8) — weakly ordered. Almost any reordering is permitted unless an explicit barrier (dmb, dsb, isb) or acquire/release instruction (ldar, stlr on ARMv8) is used. This is where x86-developed code breaks on Apple Silicon and Graviton.
  • RISC-V — weakly ordered with RVWMO (“RISC-V Weak Memory Order”). Similar in spirit to ARM.
  • POWER / PowerPC — even weaker than ARM in some respects.

The classic illustration is Dekker’s algorithm with mo_relaxed flags: works on x86, deadlocks (or rather: both threads enter the critical section) on ARM because each thread’s flag write is reordered after its read of the other’s flag. Another famous example is the independent reads of independent writes (IRIW) litmus test, on which x86 TSO and ARM differ from sequential consistency.

Fences (memory barriers). Below the named memory orders sit raw fence instructions:

  • x86: mfence (full), lfence (load), sfence (store). x86 TSO means sfence is mostly redundant; mfence is the one you actually need.
  • ARMv8: dmb ish (data memory barrier, inner shareable domain), dmb ishld (load-load only), dmb ishst (store-store only).
  • C11 / C++11: atomic_thread_fence(mo).

Compiler fences (preventing reordering by the compiler without emitting any CPU instruction) are sometimes needed even on x86: asm volatile("" ::: "memory") in GCC/Clang, std::atomic_signal_fence in C++.

4.5 Cache coherence and store buffers

Modern multicore CPUs use a cache coherence protocol to keep per-core L1 caches consistent: MESI (Modified / Exclusive / Shared / Invalid) is the textbook version; MOESI (adding Owned) and MESIF (Forwarding, Intel’s variant) are common extensions.

When core A writes a cache line that core B holds, A sends an invalidate to B; B drops its copy; subsequent reads by B trigger a read-miss that fetches the line from A. This is the reason inter-thread communication is slow: ~50-100 cycles for a cache-line bounce vs ~4 cycles for an L1 hit.

The store buffer holds writes that haven’t yet drained to the cache. A core’s own subsequent reads see its own store buffer, but other cores don’t — this is the source of the “load reordered with prior store” effect on x86 TSO.

5. Mutexes and locks

A mutex (mutual exclusion lock) serializes access to a critical section. The flavors:

  • Spinlock — busy-waits in a tight CAS loop until the lock is free. Use only when expected contention is very short (nanoseconds) and the cost of a kernel call would dwarf the wait. Kernels use spinlocks heavily (spin_lock in Linux); userspace usually doesn’t.
  • Mutex (sleeping) — kernel-assisted; a thread that fails to acquire is put to sleep and woken when the lock is released. On Linux, implemented via the futex (fast userspace mutex, Hubertus Franke + Rusty Russell + Matthew Kirkwood, 2002): the fast path is a userspace atomic CAS; only contention goes into the kernel.
  • Reader-writer lock (rwlock) — multiple concurrent readers OR one exclusive writer. Useful when reads vastly outnumber writes. Watch for writer starvation under sustained read load; most rwlocks bias writers after some threshold. POSIX pthread_rwlock_t, C++17 std::shared_mutex.
  • Recursive mutex (reentrant) — the same thread may lock multiple times without deadlock. POSIX PTHREAD_MUTEX_RECURSIVE, Java synchronized, ReentrantLock. Usually a code smell: needing reentrance means your locking discipline isn’t clear about who owns the lock.
  • Adaptive mutex — spins briefly (a few hundred cycles) before falling back to sleep. The default pthread_mutex_t on glibc since around 2010. Combines spinlock latency for short contention with mutex efficiency for long contention.
  • Ticket lock — FIFO fairness; each thread takes a ticket and waits for its turn. Linux kernel used ticket spinlocks 2008-2014, then replaced with qspinlock (queued spinlock, Waiman Long 2014) which is cache-friendlier under contention.

Priority inversion + inheritance. When a low-priority thread holds a lock that a high-priority thread needs, and a medium-priority thread preempts the low-priority one, the high-priority thread is blocked indefinitely. This bit NASA’s Mars Pathfinder in 1997 (described in detail by Glenn Reeves’s “What Really Happened on Mars” post-mortem). The fix is priority inheritance: while a high-priority thread blocks on a mutex held by a lower-priority thread, the holder temporarily inherits the waiter’s priority. VxWorks fixed Pathfinder by flipping a config flag to enable it. Linux supports priority inheritance via pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT).

Lock poisoning. Rust’s std::sync::Mutex poisons itself if a thread panics while holding the lock; subsequent lock() calls return PoisonError and the caller must explicitly choose to recover. This forces the developer to think about whether the invariants protected by the lock were left in a consistent state after the panic. parking_lot::Mutex deliberately omits poisoning, viewing it as a footgun more often than a feature; that decision is one of the visible philosophical differences between the std and parking_lot worlds.

Scoped locks (RAII). C++ std::lock_guard / std::unique_lock / std::scoped_lock, Rust MutexGuard, Java try-with-resources around explicit Locks: tie lock release to a scope so exceptions and early returns can’t leak a held lock. std::scoped_lock (C++17) also takes multiple mutexes and acquires them deadlock-free via std::lock’s try-and-back-off algorithm.

6. Condition variables

A condition variable (CV) lets threads wait until a predicate becomes true, releasing a mutex while waiting and reacquiring it on wakeup. The canonical use:

pthread_mutex_lock(&m);
while (!predicate) {
    pthread_cond_wait(&cv, &m);   // atomically: unlock m, sleep, relock m on wake
}
// predicate holds; do work
pthread_mutex_unlock(&m);

Three rules every time:

  1. Always check the predicate in a while loop, never if. POSIX (and Java, and basically everyone) explicitly allows spurious wakeups: pthread_cond_wait may return without any signal having been sent. Implementations exploit this for efficiency.
  2. Always hold the mutex while checking the predicate and while signaling. The mutex protects the predicate’s data.
  3. signal wakes one waiter; broadcast wakes all. Use broadcast when the state change might satisfy multiple distinct predicates (different threads waiting for different things on the same CV).

Java’s equivalent is Object.wait() / notify() / notifyAll(); the equivalent in java.util.concurrent is Condition.await()/signal()/signalAll(). C++ has std::condition_variable::wait(lock, pred) which encapsulates the loop.

7. Semaphores

The semaphore is the oldest synchronization primitive, introduced by Edsger Dijkstra in 1965 (“Cooperating Sequential Processes,” EWD-123). A counting integer supporting two atomic operations:

  • P (proberen, “test”) / wait / down — decrement; block if result would be negative.
  • V (verhogen, “increment”) / signal / post / up — increment; wake one waiter if any.

A semaphore initialized to 1 is a mutex (binary semaphore). A semaphore initialized to N bounds concurrency to N. Classical applications:

  • Producer-consumer — two semaphores: empty (counts free buffer slots, init N) and full (counts filled slots, init 0). Producer: P(empty); produce; V(full). Consumer: P(full); consume; V(empty).
  • Dining philosophers (Dijkstra 1965, posed as a teaching problem) — five philosophers, five forks, prevent deadlock by acquiring lower-numbered fork first or by limiting concurrency to 4 with a semaphore.

POSIX semaphores: sem_init, sem_wait, sem_post. Java java.util.concurrent.Semaphore. Go has no built-in; chan struct{} of capacity N is the idiomatic semaphore.

Named vs unnamed semaphores. POSIX distinguishes sem_init (unnamed, lives in process memory or mmap-shared memory for cross-process use) from sem_open (named, kernel-tracked, lives in /dev/shm/sem.<name>). Named semaphores survive process exit unless explicitly sem_unlink-ed, a frequent source of “phantom” semaphores wedging system state. macOS deprecated sem_init (returns ENOSYS) and only supports named semaphores — a portability gotcha.

Semaphores vs CVs. Both can implement producer-consumer, but they differ in what state they encode. A CV is stateless — a wakeup that arrives before a waiter is lost. A semaphore remembers posts (its counter increments), so a post before a wait is not lost. This makes semaphores easier to reason about for simple counting patterns, while CVs are more flexible for complex predicates over shared state.

Why Linux dropped sysv semaphores. System V semaphores (semget, semop) predate POSIX and remain in the kernel for backward compatibility, but they have a famously awkward API (semaphore sets, undo lists, SEM_UNDO flag). POSIX semaphores are the right modern choice on Linux/Unix.

8. Channels and CSP

C.A.R. Hoare’s Communicating Sequential Processes (CACM August 1978) proposed message passing as an alternative to shared memory: processes are isolated; communication happens via named channels; reads and writes on the same channel synchronize.

  • Unbuffered (synchronous, rendezvous) channel — a send blocks until a paired receive arrives, and vice versa. No state in the channel; it’s pure synchronization. Hoare’s original CSP, occam, Go’s make(chan T) (zero capacity).
  • Buffered (asynchronous) channel — fixed capacity; send blocks only when full, receive blocks only when empty. Go’s make(chan T, N).

Go channels are the most widespread modern implementation:

  • ch <- v send, v := <-ch receive.
  • close(ch) signals end-of-stream; receivers see zero value + ok=false after drain.
  • select { case x := <-ch1: ... case ch2 <- y: ... case <-time.After(d): ... } for multi-channel multiplexing; non-deterministic among ready cases. The Go select corresponds directly to CSP’s alternative.

Rust channels:

  • std::sync::mpsc — multi-producer, single-consumer (the receiver isn’t Sync).
  • crossbeam crate (Stjepan Glavina et al.) — crossbeam::channel, multi-producer multi-consumer, with select! macro. Generally faster than the standard mpsc.
  • flume — another popular mpmc channel crate.

Python:

  • asyncio.Queue — async queue for tasks within one event loop.
  • multiprocessing.Queue — for process-to-process (pickled messages).
  • Trio (Nathaniel J. Smith) — MemoryChannel (send/receive halves), the spiritual successor of Go-style channels in Python.

Erlang/Elixir BEAM — no shared memory at all. Each process has a mailbox; Pid ! Message sends; receive ... end selects with pattern matching. This is actor-flavored CSP — the channel is conceptually built into the process pair.

occam and Transputer. Worth a historical note: David May’s occam language (Inmos, early 1980s) implemented CSP directly on the Transputer processor, which had hardware channel instructions. A short-lived but influential demonstration that CSP could be a primitive at the silicon level, not only at the language level. Go and Limbo (Plan 9) inherited the spirit; Rob Pike worked on both.

Channel anti-patterns. Unbounded buffered channels eliminate backpressure — the sender never blocks, so a slow consumer causes the channel to grow without bound until OOM. Always either use unbuffered channels (forces rendezvous) or bound the buffer to a value chosen for the actual latency-throughput trade-off. Closing a channel from the receiver side is also an anti-pattern in Go; the convention is that only the sender closes (otherwise a send on a closed channel panics).

Fan-out / fan-in patterns. A producer feeds N parallel workers via one channel (fan-out); the workers feed a single aggregator via another channel (fan-in). In Go, the canonical merge is a goroutine per source forwarding into one destination channel and sync.WaitGroup closing the destination when all sources finish. The pattern generalizes to any actor system; in async/await it’s tokio::select! over multiple receivers or asyncio.as_completed over futures.

Channel direction types. Go lets you constrain a channel parameter to send-only (chan<- T) or receive-only (<-chan T); the type system then forbids the wrong direction. This is the language-level analog to “the producer should never read from the queue and the consumer should never write” — caught at compile time rather than at code review.

9. Actors

Carl Hewitt’s actor model (with Bishop and Steiger, “A Universal Modular Actor Formalism for Artificial Intelligence,” IJCAI 1973; formalized by Gul Agha in his 1986 MIT PhD thesis “Actors: A Model of Concurrent Computation in Distributed Systems”) is the older sibling of CSP, with three differences in emphasis:

  • An actor has identity (its address); CSP processes communicate via channels, not addresses.
  • Sends are asynchronous; the sender doesn’t wait. Each actor has a mailbox.
  • An actor processes one message at a time. Inside that processing it can: change its own state, send messages, spawn new actors, change its behavior for the next message.

Implementations:

  • Erlang (Joe Armstrong, Robert Virding, Mike Williams at Ericsson, 1986; open-sourced 1998) and Elixir (José Valim 2011) — actors are first-class OS-process-like entities on the BEAM VM. Cheap: ~300 bytes per process, millions per node.
  • Akka (Jonas Bonér, 2009) — actors on the JVM. In September 2022, Lightbend relicensed Akka under BSL (no longer Apache 2.0); the community forked it as Apache Pekko which became a TLP in 2024.
  • Orleans (Microsoft Research 2010+, .NET) — “virtual actors” (grains) that are activated on demand and persisted to storage; built for Halo’s matchmaking and inventory services.
  • Ray (UC Berkeley RISELab 2017, OSS 2018) — Python distributed actors for ML; @ray.remote class defines an actor.
  • Pony (Sylvan Clebsch et al., 2014) — actor language with a reference-capability type system that statically prevents data races.

Let it crash + supervision. Erlang’s OTP introduced the slogan let it crash: don’t try to defensively program against every error inside an actor; let it die and let a supervisor restart it from a known-good state. Supervision trees define restart strategies (one-for-one, one-for-all, rest-for-one). This shifted reliability engineering from “no bugs in the actor” to “bounded blast radius for inevitable bugs.”

Mailbox overflow is the actor-model analog of unbounded channel growth: a fast producer + slow actor → mailbox grows without bound. Akka, Pekko, and Orleans all provide bounded mailboxes with overflow policies (drop-newest, drop-oldest, drop-buffer, fail, suspend producer). Choose explicitly.

Selective receive. Erlang’s receive pattern-matches over the mailbox; messages that don’t match are skipped and remain in the mailbox for later. This is powerful — it lets an actor temporarily ignore irrelevant messages while waiting for a specific reply — but it’s also a frequent source of subtle bugs (O(n) scan per receive if the mailbox grows). The Erlang VM added receive ... after timeouts and OTP-style gen_server:call/cast to channel most everyday use through a structured pattern.

Actor vs CSP, restated. A useful frame from Joe Armstrong: in CSP, channels are first-class and processes are second-class (you name channels and refer to them, processes are anonymous). In the actor model, processes are first-class and channels are second-class (you name actors via their PID and address messages to them; the “channel” is the implicit pair between sender and receiver). The two models are mathematically equivalent in expressiveness but the practical ergonomics differ. CSP’s named channels make pipeline composition easy; actors’ named addresses make request-reply patterns easy.

Distributed actors. Erlang actors are location-transparent: a PID can refer to a process on the same node or another node, with the same send syntax. This is both Erlang’s killer feature (built-in distribution) and a footgun (latency and partial failure are hidden behind a local-looking API — see Waldo et al.’s “A Note on Distributed Computing” 1994 for the canonical warning). Akka faced the same tension and explicitly recommends against treating remote actors as identical to local ones.

Backpressure in actor systems. Push-based mailboxes accumulate work indefinitely; pull-based actors (where the consumer requests N messages) are the actor-system answer to the unbounded-channel problem. Reactive Streams (Kaazing, Lightbend, Netflix, Pivotal, Red Hat, Twitter; 2015 spec, JDK 9 Flow interfaces) standardizes this contract across Akka, RxJava, Reactor, Vert.x, and others. The four method names — onSubscribe, onNext, onError, onComplete plus Subscription.request(n) — are now lingua franca for backpressured async streams across the JVM ecosystem.

10. Async / await

A future (or promise) represents a value that may not yet be computed. Async/await is syntactic sugar over futures: an async function returns a future; await suspends until the future resolves, yielding control to the event loop.

  • JavaScriptPromise (since ES2015), async/await (ES2017). Single-threaded event loop in browsers and Node; parallelism via Web Workers / worker_threads.
  • Pythonasyncio (PEP 3156, Guido van Rossum, Python 3.4 2014); async def / await (PEP 492, Yury Selivanov, Python 3.5 2015).
  • RustFuture trait (RFC 2592); async/.await stabilized in 1.39 (Nov 2019). Compiled to state machines, no runtime allocation per task. A runtime (tokio, async-std, smol, embassy) drives the futures.
  • C#Task<T> + async/await (C# 5, .NET 4.5, 2012). Anders Hejlsberg championed it; arguably the design that mainstreamed the syntax.
  • Kotlinsuspend fun + coroutines (Kotlin 1.3, 2018). Compiled to CPS (continuation-passing style).
  • Swiftasync/await (Swift 5.5, 2021), with actor types added the same year.

The function-coloring problem. Bob Nystrom’s 2015 essay “What Color Is Your Function?” pointed out that async functions form a separate ecosystem from sync functions: an async function can only be awaited from inside another async function; calling a sync function from async is free, but calling async from sync requires either blocking (.get(), .wait(), tokio::block_on) or running an event loop. Library authors face the awkward choice of writing each function twice, picking a color, or providing both (often via macros like maybe_async).

Async vs threads. Async tasks are typically much cheaper than OS threads (a few hundred bytes of state vs a megabyte of stack), and a context switch between async tasks at an await point is essentially a function return. The cost: composing with blocking code is awkward — a time.sleep(1) in a Python asyncio coroutine stalls the whole event loop. Async runtimes provide thread-pool offload (asyncio.to_thread, tokio::task::spawn_blocking) to bridge.

Structured concurrency. Martin Sústrik’s libdill (2016) and Nathaniel Smith’s “Notes on structured concurrency, or: Go statement considered harmful” (2018 blog post) crystallized a principle: every spawned task should belong to a scope that does not exit until all its children have completed or been cancelled. This eliminates orphan tasks (the async equivalent of leaked threads). Implementations: Trio nurseries, Kotlin coroutineScope { }, Swift TaskGroup, Java 21 StructuredTaskScope (JEP 453, preview).

Cancellation propagation. Closely tied to structured concurrency: cancelling a parent task should cancel all its descendants. Done well in Trio (cancel scopes), Kotlin (CoroutineScope cancellation), Swift (Task.cancel), Go (context.Context). Done badly in JS (no cancellation in Promise at all; AbortController is a 2017-era retrofit) and in early Python asyncio (cancel was best-effort).

Timeouts as cancellation. Almost every async API has a timeout variant: tokio::time::timeout, asyncio.wait_for, Promise.race with a timer, withTimeout in Kotlin, task::with_deadline in Swift. The pattern is uniform: spawn the work, race it against a timer, cancel whichever loses. The subtle question is whether cancellation is cooperative (the worker checks and bails) or forceful (the runtime kills the worker mid-operation). Almost always the former; forceful cancellation reintroduces the pthread_cancel footguns.

Backpressure via bounded queues. In an async pipeline, the bounded queue between stages is the simplest backpressure mechanism: when downstream slows, the queue fills, upstream’s send await blocks, propagation reaches the source. No special protocol needed — the await-await chain implements TCP-style flow control through the system.

Async streams. Iteration over an async sequence: async for in Python, Stream in Rust (futures::Stream), Flow in Kotlin, AsyncSequence in Swift, async iterators in JS (for await...of). The async-stream concept generalizes channels: a producer yields values asynchronously and a consumer awaits them, with backpressure typically provided by request-pull semantics rather than channel buffer size.

The reactor and proactor patterns. Doug Schmidt’s Pattern-Oriented Software Architecture vol. 2 (2000) named these:

  • Reactor: the event loop demultiplexes readiness events (epoll_wait, kqueue, WSAPoll) and dispatches synchronously to handlers. The handler then performs the actual I/O. Used by libev, libuv (Node.js), Python’s asyncio default loop, Java NIO Selector.
  • Proactor: the OS completes the I/O and notifies the application. The application initiates an operation and gets a completion event. Windows IOCP, Linux io_uring (Jens Axboe, 2019), Boost.Asio with the IOCP backend. Lower CPU overhead at high throughput; more complex API.

io_uring (Linux 5.1+) is the most significant Linux I/O API addition since epoll: submission and completion queues are mmap-shared between kernel and userspace, eliminating syscalls for batched I/O. The tokio-uring crate, glommio (Glauber Costa), and the experimental Rust async runtime designs all build on it.

11. Parallelism patterns

  • Fork-join. Recursively split work into halves, spawn each, join the results. Cilk (Charles Leiserson et al., MIT 1994) introduced cilk_spawn/cilk_sync. Java’s ForkJoinPool (Doug Lea, Java 7, 2011) uses work-stealing among threads. Rayon (Niko Matsakis, Rust 2015) brings par_iter() data-parallel and join() task-parallel via the same scheduler.
  • Map-reduce. Jeffrey Dean + Sanjay Ghemawat at Google, “MapReduce: Simplified Data Processing on Large Clusters” (OSDI 2004). Apply a map to every element, group by key, apply a reduce per key. Implementations: Google’s internal system, Hadoop (Doug Cutting + Mike Cafarella 2005), Spark (Matei Zaharia, UC Berkeley AMPLab, 2010) with in-memory RDDs replacing Hadoop’s disk-based shuffles.
  • Pipeline. Producer → stage 1 → stage 2 → consumer, each stage on its own thread/goroutine/actor. The LMAX Disruptor (Martin Thompson, Dave Farley, Michael Barker, 2011) is the canonical low-latency variant: a single ring buffer + multiple consumer “barriers” + busy-spin instead of locks. Used in financial exchanges.
  • Worker pool. A bounded queue of tasks and N worker threads pulling from it. The “thread pool” of java.util.concurrent.ExecutorService (newFixedThreadPool), Python’s concurrent.futures.ThreadPoolExecutor, Go’s pattern of N goroutines reading from a channel.
  • Data-parallel. OpenMP (1997, started by Sandia/SGI/etc.) — #pragma omp parallel for decorates loops. SIMD intrinsics_mm256_add_ps and friends on x86 AVX2/AVX-512, NEON on ARM. CUDA (NVIDIA, 2007) for GPU; the canonical massively-data-parallel model.
  • Bulk-synchronous parallel (BSP). Leslie Valiant 1990. Computation proceeds in supersteps: each processor computes locally, then a barrier synchronizes all processors, then communication happens, then another barrier. The model behind Pregel (Malewicz et al., Google 2010), Giraph, GraphX. Predictable cost model trades flexibility for analyzability.
  • Scatter-gather. Distribute (scatter) data fragments to workers, collect (gather) results. MPI’s bread-and-butter (MPI_Scatter, MPI_Gather, MPI_Allreduce). Foundational to scientific computing on clusters.
  • Stencil computation. Update each cell as a function of its neighbors; common in physics simulation, image processing, ML convolutions. Tiling for cache locality is critical; tools like Halide (Ragan-Kelley et al., MIT 2012) auto-schedule stencil loops across CPU/GPU.
  • Embarrassingly parallel. No interaction between work items; e.g., Monte Carlo simulation, hyperparameter search, ray tracing per-pixel. Trivially scales to N workers with near-linear speedup until bandwidth-bound.

12. Lock-free and wait-free

A lock-free data structure guarantees that some thread makes progress at every step, even if others are paused. Wait-free is stronger: every thread makes progress in a bounded number of steps. Both avoid the priority inversion / convoying / deadlock failure modes of locks.

  • Treiber stack (R. Kent Treiber, “Systems Programming: Coping with Parallelism,” IBM RJ 5118, 1986) — a singly-linked stack with CAS on the top pointer. Vulnerable to ABA without mitigation.
  • Michael-Scott queue (MS-queue) — Maged Michael + Michael Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms” (PODC 1996). Two CAS-based queues — a lock-free one and a two-lock one. The lock-free variant is the basis of java.util.concurrent.ConcurrentLinkedQueue.
  • Hazard pointers — Maged Michael, “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects” (IEEE TPDS 2004). Each thread publishes pointers it is currently dereferencing; reclamation waits until no hazard pointer references the node.
  • RCU (Read-Copy-Update) — Paul McKenney + John Slingwine, “Read-Copy Update” (IPDPS 1998). Readers go lock-free with no memory barriers on the hot path; writers create a new version and wait for a grace period (all pre-existing readers to finish) before reclaiming the old. Pervasive in the Linux kernel — rcu_read_lock, synchronize_rcu, call_rcu. Userspace RCU lib (Mathieu Desnoyers et al.) is also widely used.
  • Disruptor (LMAX, Thompson et al. 2011) — already mentioned under pipelines. A single-writer ring buffer where readers track their sequence; cache-line-padded entries avoid false sharing; busy-spin waits avoid kernel calls. Achieves sub-microsecond message latency on commodity hardware.
  • Concurrent skip lists — Maurice Herlihy et al.; basis of java.util.concurrent.ConcurrentSkipListMap.
  • Lock-free hash maps — Cliff Click’s NonBlockingHashMap (2007), folly’s AtomicHashMap, Rust’s dashmap (Joel Wejdenstal).
  • Flat combining — Hendler, Incze, Shavit, Tzafrir (SPAA 2010). Threads publish their operations to a per-thread slot; one elected “combiner” thread performs all of them locally and writes back results. Trades parallelism for cache locality; often beats fine-grained lock-free designs on simple data structures like stacks and queues at moderate contention.
  • Universal constructions. Herlihy 1991 showed any sequential object can be made wait-free via a CAS-based universal construction (every operation produces a new immutable version, threads CAS the version pointer, helpers complete each others’ operations). Theoretical foundation; rarely used directly because constant factors are large.

ABA in practice. Real fixes:

  • Tagged pointer — pack a version counter into unused bits of the pointer (top 16 bits on x86-64 are unused) and CAS both atomically.
  • DCAS / cmpxchg16b — atomically compare-and-swap a 16-byte pair (pointer + counter). Available on x86-64, ARMv8.1+; emulated awkwardly elsewhere.
  • Hazard pointers — published per-thread pointers; reclamation waits.
  • Epoch-based reclamation — global epochs; reclamation deferred until all threads pass.
  • Quiescent-state-based reclamation (QSBR) — used in URCU; threads periodically declare they hold no references; reclamation waits for a full round.

The bible is Maurice Herlihy + Nir Shavit, The Art of Multiprocessor Programming, 2nd ed. 2020.

Progress hierarchy (formalized by Herlihy 1991, “Wait-Free Synchronization,” ACM TOPLAS):

  • Obstruction-free — a thread running in isolation completes in finite steps. Weakest guarantee.
  • Lock-freesome thread completes in finite steps, system-wide. Stronger.
  • Wait-freeevery thread completes in finite steps. Strongest, often theoretical (most wait-free constructions have impractical constant factors).
  • Bounded wait-free — every thread completes in a bound that depends only on the number of threads, not on contention.

Herlihy’s same paper also introduced the consensus hierarchy: each synchronization primitive has a consensus number — the maximum number of threads for which it can solve wait-free binary consensus. Atomic registers have consensus number 1 (cannot solve consensus among 2+ threads); test-and-set has 2; CAS has infinity. This is why every wait-free algorithm uses CAS or equivalent — weaker primitives are provably insufficient.

Epoch-based reclamation (EBR). An alternative to hazard pointers for safe memory reclamation in lock-free structures. Threads operate in epochs; a node retired in epoch N can be freed once all threads have advanced past epoch N. Used by Linux RCU and by the crossbeam-epoch crate in Rust. Lower per-operation overhead than hazard pointers, at the cost of potentially deferred reclamation.

13. Software transactional memory (STM)

STM treats memory like a database: a block of operations runs as a transaction; the runtime detects conflicts and either commits atomically or retries from the start.

  • Haskell STM — Tim Harris, Simon Marlow, Simon Peyton Jones, Maurice Herlihy, “Composable Memory Transactions” (PPoPP 2005). Haskell’s pure-by-default type system makes STM uniquely clean: the type STM a cannot perform I/O, so retry is always safe. atomically, retry, orElse are the user-facing combinators.
  • Clojure refs — Rich Hickey 2007. ref-set, alter, dosync block; transactions over immutable values.
  • Scala STMscala-stm library (Bronson, Herlihy et al. 2010).
  • Hardware transactional memory — Intel TSX (Transactional Synchronization Extensions, Haswell 2013). Hit critical security bugs (TAA — TSX Asynchronous Abort, 2019; followed by erratum-driven disabling on most CPUs). Intel deprecated TSX on consumer parts; the technology was effectively withdrawn from broad use in 2019-2021. IBM POWER8/POWER9 had a similar HTM implementation; ARM has TME (Transactional Memory Extension) defined in ARMv9 but limited deployment as of 2025.

STM vs locks trade-off. STM excels at composability: combining two atomic operations into one bigger atomic operation is trivial (atomically $ do { op1; op2 }), whereas combining two locked operations requires careful manual lock-ordering analysis. STM struggles at throughput under contention: if transactions abort and retry repeatedly on a hot variable, you can do worse than a simple mutex. Optimistic concurrency control is a great match for read-mostly workloads, a poor match for write-heavy ones.

Lock elision. A related compiler/runtime technique: speculatively run a locked critical section without acquiring the lock, monitoring memory accesses; if no conflict is detected, commit; otherwise fall back to acquiring the lock. Intel called this “Hardware Lock Elision” (HLE) and shipped it alongside TSX; same fate. Software lock elision implementations remain in some research compilers.

STM’s weakness in non-pure languages is the “irrevocable action” problem: if a transaction calls printf, you can’t roll it back.

14. Common concurrency bugs

  • Race condition — broadly, any bug arising from interleavings of concurrent operations. Often a non-atomic read-modify-write: if (cache == null) cache = compute() executed by two threads yields two computations and a discarded result.
  • Data race — a strictly narrower concept: two threads access the same memory, at least one writes, with no synchronization between them. The C11/C++11 standards declare data races to be undefined behavior; the Go and Java memory models declare them to produce undefined values but bounded crashes. Note: a race condition can exist without a data race (atomic increment is data-race-free but still racy if you depend on the order).
  • Deadlock — circular wait on locks: T1 holds A, wants B; T2 holds B, wants A. Detect with the four Coffman conditions (mutual exclusion, hold-and-wait, no preemption, circular wait); prevent by lock ordering (acquire locks in a fixed global order — typically by address or by depth in a hierarchy). Java’s jstack, Linux’s gdb + thread apply, can spot held-lock cycles post mortem.
  • Livelock — threads remain runnable but make no progress. Two polite people stepping aside in a hallway. Fix with randomized backoff (Ethernet binary exponential backoff is the classic).
  • Starvation — a thread is repeatedly denied resources by a non-fair scheduler. Fix with fair locks (ticket locks, fair semaphores, ReentrantLock(fair=true)).
  • Memory ordering bug — code that uses mo_relaxed (or just plain non-atomic accesses) where acquire/release is needed. Tends to “work” on x86 because of its strong TSO and fail on ARM/POWER. Apple Silicon migration in 2020 surfaced thousands of these in the wild.
  • ABA problem — in a CAS loop, thread T1 reads value A, T2 changes A→B→A, T1’s CAS succeeds against A but the world has changed (e.g., a freed-and-reused linked-list node). Fixes: tagged pointers (low bits of the pointer carry a version counter; cmpxchg16b on x86-64 can swing both halves atomically), hazard pointers, epoch-based reclamation (crossbeam-epoch in Rust).
  • False sharing — two threads write to distinct variables that happen to share a cache line; the line bounces between cores like a hot potato. Fix by padding to cache-line size (alignas(64) in C++, #[repr(align(64))] in Rust, @Contended in Java, padding fields by hand).
  • Lost wakeup — calling signal before the waiter calls wait, with no mutex protecting the predicate, can lose the wakeup. The “check predicate while holding mutex; wait on cv with mutex” idiom prevents this.
  • Deadlock-by-recursive-lock — non-reentrant mutex held by T1 re-entered by T1 → deadlock. Either use a recursive mutex (with the caveats above) or restructure to release before recursing.
  • Time-of-check to time-of-use (TOCTOU) — checking a condition and then acting on it without holding a lock that prevents the condition from changing. Classic example: if (!file_exists(p)) create(p); — two threads can both pass the check and both create. Fix: acquire a lock across check + act, or use an atomic primitive (O_EXCL flag, compare_exchange).
  • Double-checked locking, done wrong. The C++03-era idiom if (!instance) { lock(); if (!instance) instance = new T; unlock(); } is broken without atomics — the publication of instance can be reordered such that another thread sees a non-null pointer to partially-constructed memory. Doug Schmidt + Tim Harrison 1996 originally proposed the pattern; the broken-without-fences version was canonical until C++11 made fixing it cheap (use std::atomic<T*> with mo_acquire/mo_release, or just use std::call_once).
  • Convoying — once many threads pile up on a contended lock, a single slow holder causes everyone behind to queue, and the queue persists long after the original delay clears. Backoff and adaptive locks mitigate; lock-free structures avoid entirely.
  • Thundering herdpthread_cond_broadcast (or its equivalent) wakes N threads that all immediately try to acquire the same mutex, only one can proceed, the rest go back to sleep. Wastes CPU. Linux futex FUTEX_REQUEUE exists specifically to relocate waiters from a CV to the mutex without waking them.

15. Tools

Race / deadlock detection:

  • ThreadSanitizer (TSan) — Konstantin Serebryany et al., Google 2009. Compile with -fsanitize=thread; instruments memory accesses; reports data races at runtime with stack traces of both racing accesses. Supported by Clang, GCC, Rust (nightly). The single most useful tool for C/C++/Rust concurrency.
  • Valgrind Helgrind and DRD — slower than TSan but no recompile required.
  • Go -race — built on TSan; go test -race. Mandatory in CI for any nontrivial Go service.
  • Rust Miri — interpreter that catches UB; with -Zmiri-many-seeds and the experimental data-race detector, finds memory model violations.
  • Loom (Carl Lerche, Rust) — model checker for loom::sync/loom::thread; explores all permitted interleavings of small unit tests. Essential for verifying lock-free code.
  • Java Race Detector / RV-Predict — happens-before-based detector for Java.

Stress / replay:

  • rr (Mozilla) — record-and-replay debugger for Linux; deterministic reverse execution; catches Heisenbugs by capturing the schedule.
  • Concuerror — stateless model checker for Erlang.
  • Maelstrom (Jepsen) — a “Jepsen-lite” test harness for distributed systems; same author (Kyle Kingsbury / Aphyr).

Profiling:

  • perf — Linux’s swiss-army profiler; perf record, perf top, perf c2c (cache-to-cache contention).
  • Intel VTune — proprietary; lock contention, false sharing analyses.
  • Linux ftrace — kernel-level tracepoints including scheduler events.
  • eBPF / bcc toolsoffcputime-bpf shows where threads sleep waiting; runqlat-bpf for scheduler latency.
  • Java Flight Recorder + JDK Mission Control — JEP 328, low-overhead production profiling.

Formal verification:

  • TLA+ (Leslie Lamport, 1999 onwards) — temporal logic of actions; the de facto industry tool for specifying concurrent and distributed algorithms. AWS uses it extensively (Chris Newcombe et al., “How Amazon Web Services Uses Formal Methods,” CACM 2015). The companion model checker TLC explores reachable states; TLAPS proves theorems.
  • SPIN (Gerard Holzmann, Bell Labs 1989+) — explicit-state model checker for the Promela language. Older than TLA+, still in active use.
  • CBMC (Edmund Clarke et al., 2004) — bounded model checker for C programs; can find races and assertion violations within a bounded execution depth.
  • Coq + Iris — separation logic for concurrent programs; used to verify portions of the RustBelt project (formalizing Rust’s safety guarantees).
  • CDSChecker, GenMC — stateless model checkers for the C11/C++11 memory model.

Tracing and observability:

  • OpenTelemetry (CNCF, 2019; merger of OpenTracing + OpenCensus) — distributed tracing standard; context propagation via W3C Trace Context (traceparent header). Within a single process, spans encode causal links across thread boundaries; across services, the trace_id stitches the call graph back together.
  • Lock contention profilers: pthread futex stats via /proc/<pid>/sched; Java’s jstack for lock-held analysis; mutex-events in perf lock.
  • Async stack traces. A long-standing pain point: by default, an exception in an async task shows only the local stack, not the chain of await points that led there. Modern runtimes (Java’s CompletableFuture, Python 3.11+ asyncio, Rust’s tokio-console and tracing crate, .NET’s AsyncMethodBuilder) reconstruct a logical async stack. tokio-console (Eliza Weisman et al., 2021) is particularly useful: it shows tokio tasks live, with their poll counts and last-poll latency.

16. Language idioms

  • Java. synchronized blocks/methods (intrinsic locks, biased locking removed in JDK 15+); volatile; java.util.concurrent (Doug Lea, JSR-166, Java 5, 2004) — ReentrantLock, Semaphore, CountDownLatch, CyclicBarrier, Phaser, BlockingQueue, ConcurrentHashMap (Lea’s lock-striped, since Java 5; CAS-based bins since Java 8), Atomic*, CompletableFuture (JDK 8, 2014, stage-based async), and finally virtual threads (JEP 444, JDK 21, 2023, Project Loom by Ron Pressler et al.).
  • Go. Goroutines + channels + sync.Mutex + sync.RWMutex + sync.WaitGroup + sync/atomic + context.Context for cancellation propagation. The community mantra (from Rob Pike’s “Go proverbs”): “Don’t communicate by sharing memory; share memory by communicating.”
  • Rust. Send + Sync traits encode thread-safety in the type system; the borrow checker forbids data races at compile time (modulo unsafe). std::sync::Mutex<T> (poisoning), RwLock<T>, Arc<T> (atomic reference count), std::sync::atomic. Ecosystem: tokio (Carl Lerche; the dominant async runtime), async-std, smol, rayon (data-parallel), crossbeam (lock-free primitives), parking_lot (faster Mutex / RwLock than std).
  • Python. The Global Interpreter Lock (GIL) serializes CPython bytecode execution: one thread runs Python at a time. Threads are fine for I/O-bound work (the GIL releases around blocking syscalls); CPU-bound work needs multiprocessing or a release-the-GIL C extension. concurrent.futures.ThreadPoolExecutor and ProcessPoolExecutor for both. asyncio for I/O concurrency without threads. PEP 703 (Sam Gross, accepted 2023) introduces an optional no-GIL build of CPython, experimental in 3.13 (October 2024) and improving in 3.14.
  • JavaScript. Single-threaded event loop in the browser and in Node. Parallelism via Web Workers (browser) or worker_threads (Node, stable since 12.0, 2019). SharedArrayBuffer + Atomics for shared-memory parallelism between workers. Promise.all, Promise.race, Promise.allSettled for combinators.
  • Erlang/Elixir. Pure message-passing — no shared state across processes (with the exception of ETS tables, which are an explicit escape hatch). Processes are cheap (~300 bytes); spawn a million on a single node without distress. gen_server, gen_statem, supervisors are the OTP behaviors.
  • C# / .NET. lock (intrinsic Monitor), Interlocked (atomics), Task + async/await, Parallel.ForEach, channels (System.Threading.Channels), the TPL Dataflow library.
  • Swift. actor types (Swift 5.5, 2021) — actor-isolated mutable state; structured concurrency with async let, TaskGroup, Task.cancel(); @Sendable checking.
  • Kotlin. suspend fun, CoroutineScope, Job, Deferred<T>, Flow for cold async streams, Channel for hot streams; Mutex from kotlinx.coroutines.
  • C++. std::thread, std::async, std::mutex, std::shared_mutex, std::condition_variable, std::atomic<T>, std::jthread (C++20, joins on destruction), std::stop_token (C++20 cooperative cancellation), std::barrier, std::latch. Coroutines (C++20). std::execution (“senders/receivers”, C++26).

Idiomatic patterns across languages:

  • Once initialization. Run a piece of init code exactly once, even under concurrent calls. POSIX pthread_once, C++11 std::call_once, Java static-initializer-on-class-load (the JVM guarantees this), Rust std::sync::Once / OnceLock. Replaces double-checked-locking in modern code.
  • Latch. A countdown that threads can wait on; counts down to zero exactly once and stays there. Java CountDownLatch, C++20 std::latch. Useful for “wait until N workers have all started.”
  • Barrier. Reusable rendezvous: N threads each call wait; all are released once the Nth arrives; the barrier resets. Java CyclicBarrier / Phaser, C++20 std::barrier, POSIX pthread_barrier_t. Foundation of BSP-style algorithms.
  • Concurrent collections. Almost every modern stdlib ships at least a concurrent map: Java ConcurrentHashMap, C++ uses third-party (Folly’s ConcurrentHashMap, TBB’s concurrent_hash_map, Junction), Go sync.Map (rarely the right choice — most code is better off with a plain map + Mutex), Rust dashmap. Choosing the right concurrent map (lock-striped, CAS-based, RCU-based) is workload-specific.
  • Single-producer single-consumer (SPSC) ring buffer. When you have exactly one writer and one reader, you can use a wait-free SPSC queue with no CAS — just one atomic write index and one atomic read index with acquire/release ordering. rtrb in Rust, boost::lockfree::spsc_queue, the LMAX Disruptor’s degenerate case. Used in audio (one realtime thread + one disk thread) and in NIC drivers.

17. Hardware perspective

  • Cache coherence (MESI / MOESI). Each cache line in L1/L2 is in one of: Modified (this cache has the only dirty copy), Exclusive (only copy, clean), Shared (other caches may have read copies), Invalid. MOESI adds Owned (shared but I’m responsible for write-back). Coherence traffic across cores is the dominant cost of cross-thread communication.
  • False sharing. Two unrelated variables falling on the same cache line cause coherence traffic as if they were related. Diagnostic: perf c2c on Linux. Fix: pad to 64 bytes (most x86 lines; some recent Intel CPUs use 128 with adjacent-line prefetch — the canonical “cache line size” for safety on x86-64 is 2 * 64 = 128 for production code; check std::hardware_destructive_interference_size in C++17, currently 64 on most ABIs).
  • NUMA awareness. On multi-socket systems, each socket has its own memory controller; cross-socket memory access is ~2-3x slower. numactl, libnuma, pthread_setaffinity_np pin threads near their data. Thread pools should be NUMA-aware on big-iron servers.
  • Non-temporal stores. _mm_stream_si128 and friends write directly to memory bypassing the cache — useful when you know you won’t read the data back soon (streaming writes). Avoids polluting cache.
  • Prefetch. __builtin_prefetch, _mm_prefetch — hint the cache to bring a line in before the load misses. Critical for graph traversals and pointer-chasing workloads.
  • Memory ordering instructions on common ISAs:
    • x86-64: mfence, lfence, sfence; LOCK-prefixed RMW is implicitly a full fence; xchg is implicitly locked.
    • ARMv8: ldar (load-acquire), stlr (store-release), dmb ish (full barrier), dmb ishld / ishst (partial).
    • RISC-V: fence rw,rw (full), fence r,r (load-load), variants for the RVWMO model; lr.w / sc.w for LL/SC; amoadd.w etc. for atomics.
    • POWER: sync (full), lwsync (lightweight), isync. Notoriously weak; even lwsync is needed where x86 needs nothing.
  • Read/write set tracking. Many lock-free and STM implementations maintain a per-thread read set (locations consulted) and write set (locations to be written) to detect conflicts at commit time. Hardware Transactional Memory automates this; STM does it in software with version numbers per cache line or per object.
  • Atomic instruction costs (rough x86 numbers). Uncontended atomic op: ~10-20 cycles. Contended CAS on a hot line: 100+ cycles plus coherence traffic. A LOCK-prefixed instruction blocks the bus (modern CPUs use cache-line locking instead but the cost shape is similar).
  • Memory hierarchy latencies (Skylake-era ballpark, the Norvig “latency numbers” tradition). L1: ~1 ns / ~4 cycles. L2: ~3-4 ns / ~12 cycles. L3 (shared): ~10-20 ns. DRAM: ~100 ns. Cross-socket NUMA DRAM: ~200-300 ns. A contended cache line bouncing between cores adds tens to hundreds of nanoseconds per atomic op. These numbers are why locking discipline (and especially false-sharing avoidance) matters: a poorly-laid-out concurrent data structure can underperform a single-threaded one by 10x or more.
  • Branch prediction and speculative execution. Modern out-of-order cores speculate aggressively. The Spectre (Kocher et al., 2018) and Meltdown vulnerabilities exposed that this speculation has microarchitectural side effects observable across security boundaries; mitigation (lfence placement, retpolines, KPTI) adds overhead to many synchronization paths.
  • Hyperthreading / SMT. Two threads share execution units on one physical core; useful for hiding memory latency, but a hot spinlock spinning on one SMT sibling steals execution slots from the other. Disable SMT for latency-critical real-time workloads.

16a. Inter-process communication (IPC) primitives

Within-process primitives have OS-mediated cousins for crossing process boundaries:

  • Shared memory. POSIX shm_open + mmap (Linux/macOS), CreateFileMapping + MapViewOfFile (Windows). Same atomics and memory orders apply; cross-process mutexes need PTHREAD_PROCESS_SHARED attribute. The fastest IPC.
  • Pipes and FIFOs. Byte-stream, in-kernel buffer; pipe(2) (anonymous), mkfifo (named). Standard CSP-style channel between parent and child.
  • Unix domain sockets (UDS). Bidirectional; can pass file descriptors via SCM_RIGHTS ancillary data. Most modern container/sidecar APIs (Docker, Kubernetes kubelet, systemd, gRPC over UDS) use these.
  • Message queues. POSIX mq_open/mq_send/mq_receive (kernel-mediated priority queues); System V msgget. Rarely the right choice today vs UDS or shared memory.
  • eventfd / signalfd / timerfd. Linux-specific file descriptors that integrate with epoll/select. eventfd is a kernel counter you can read/write; ideal for waking an event loop from another thread or process.
  • Signals. POSIX async signals (SIGUSR1 etc.) are the oldest IPC primitive and the most footgun-laden — async-signal-safe function list is tiny (no malloc, no printf, basically just write, _exit, atomic ops on sig_atomic_t). The modern recommendation: convert signals to file-descriptor events via signalfd and handle them in the main event loop.
  • DBus, gRPC, ZeroMQ. Higher-level message-passing libraries that build on the above.

16b. Determinism and reproducibility

Three different things people mean by “deterministic concurrency”:

  1. Result-deterministic — the same output for the same input, even if internal scheduling differs. Pure functional parallel reductions (fold with an associative-commutative operator) qualify; floating-point reductions usually don’t (FP add is not associative).
  2. Schedule-deterministic — given the same input and a fixed scheduler seed, identical thread interleavings every run. Madsim, FoundationDB Flow, dEEP-replay achieve this.
  3. Replay-deterministic — record the schedule on one run and reproduce it on a later run. rr, Pernosco, time-travel debuggers.

Determinism is wonderful for testing and debugging; it usually costs performance (forcing schedules into a canonical order eliminates parallelism). Production systems are usually non-deterministic; their test harnesses are deterministic.

17a. Real-time and deterministic concurrency

Some concurrency contexts demand bounded latency rather than throughput:

  • Hard real-time (avionics DO-178C, automotive ISO 26262 ASIL-D, medical IEC 62304 class C) — missing a deadline is a system failure. Tools: priority-ceiling protocol (Sha, Rajkumar, Lehoczky 1990), rate-monotonic scheduling (Liu + Layland, JACM 1973), earliest-deadline-first (EDF), worst-case execution time (WCET) analysis. RTOSes: VxWorks, QNX, INTEGRITY, FreeRTOS (Richard Barry, 2003), Zephyr (Linux Foundation, 2016).
  • PREEMPT_RT. A long-running patch series for Linux (Ingo Molnár, Thomas Gleixner, 2004-2024) that makes the kernel fully preemptible, converts spinlocks to rt-mutexes (with priority inheritance), and re-targets interrupt handlers to kernel threads. Merged piecewise; the bulk landed in Linux 6.12 (November 2024). Brings sub-millisecond worst-case scheduling latency to mainline Linux.
  • Real-time garbage collection. A traditional disqualifier for managed languages in real-time contexts is GC pause times. Modern collectors push this down: Java’s ZGC (JEP 333, Per Liden, JDK 11+) and Shenandoah (Red Hat, JDK 12+) target sub-millisecond pauses on multi-terabyte heaps; .NET’s server GC has incremental concurrent modes; Go’s collector (since 1.5, 2015) targets ~1 ms pauses via concurrent mark-and-sweep with write barriers. These don’t reach hard-realtime guarantees but suffice for soft-realtime workloads like trading systems.
  • Lock-free in interrupt context. Kernel and driver code that may run in interrupt context must avoid sleeping; that rules out mutexes (which can put the holder to sleep). Spinlocks with interrupts disabled (spin_lock_irqsave on Linux), per-CPU variables, RCU, and seqlocks (Stephen Hemminger 2003 — writer-rare, optimistic-reader, retry-on-version-change) are the standard tools.
  • Soft real-time (audio, video, games) — occasional misses are tolerable but bad UX. Audio threads typically run at SCHED_FIFO with priority above the kernel’s, must never allocate, never lock a mutex contended by a non-realtime thread, never call into anything that might malloc. The realtime_tools and jack ecosystems on Linux codify these rules.
  • Deterministic replay — given the same inputs, the system must produce the same outputs even though concurrency normally introduces non-determinism. Achieved by record-and-replay (rr, Pernosco), by single-threaded simulation of an actor system (Madsim for Rust, FoundationDB’s Flow simulator — the latter let FoundationDB ship a famously bug-free distributed database, see Will Wilson’s “Strange Loop 2014” talk).

17b. Concurrency in databases

Databases are concurrency systems in disguise. Some primitives:

  • Two-phase locking (2PL). Acquire all locks (growing phase), do work, release all locks (shrinking phase). Strict 2PL holds write locks until commit. The classical serializability foundation (Eswaran et al., CACM 1976).
  • Multi-version concurrency control (MVCC). Each write creates a new version; readers see a consistent snapshot from their start time. PostgreSQL, Oracle, MySQL InnoDB, CockroachDB, FoundationDB. Avoids reader-writer blocking entirely.
  • Snapshot isolation. A common practical level: each transaction sees a snapshot of the database as of its start. Susceptible to write skew anomalies (Berenson et al., SIGMOD 1995), which serializable snapshot isolation (SSI, Cahill et al., SIGMOD 2008, in PostgreSQL 9.1) fixes.
  • Optimistic concurrency control (OCC). Kung + Robinson, ACM TODS 1981. Skip locking; validate at commit; abort and retry on conflict. Used by FoundationDB, CockroachDB, and (with twists) by most NoSQL CRDT systems.

The vocabulary maps directly to in-memory concurrency: a mutex is a single-resource 2PL lock; STM is database-style OCC applied to RAM.

18. Cross-references

  • _index — the compute reference family entry.
  • distributed-systems-fundamentals — concurrency is the within-node case; that note picks up at multi-node coordination, consensus, and partial failure.
  • transformer-architecture — data parallelism (DDP), model parallelism, tensor parallelism, pipeline parallelism in training; same patterns scaled across GPUs.
  • consensus-protocols — Paxos, Raft, multi-leader; the distributed analogue of mutex/CV/atomic.
  • build-devops — runtime selection and configuration (GOMAXPROCS, TOKIO_WORKER_THREADS, JVM thread-pool sizing) is a build/devops concern in practice.

18a. Anti-patterns and folklore

A short catalogue of things that frequently appear in concurrent code and are usually wrong:

  • “I’ll just add a lock around it.” Adding locks to fix a race often just papers over a deeper design issue (shared mutable state that didn’t need to be shared). Step back: can the data be made immutable? Owned by exactly one actor? Reorganized so the access is naturally serial?
  • “It works on my machine.” Concurrency bugs are timing-dependent; a 16-core developer laptop can hide bugs that appear on a 64-core production server, or vice versa. Reproducibility requires intentional stress: bind threads to specific cores, run with TSan, perturb timing with sleeps deliberately inserted at suspect points (the “race-condition-fuzzing” technique).
  • “This is faster without the lock.” Often true in microbenchmarks; usually wrong about correctness. The famous example is the volatile-only Singleton in pre-JSR-133 Java that was believed to work for years.
  • Sleeping to fix a race. Thread.sleep(100) after starting a thread to “wait for it to be ready” is a permanent bug looking for a CI machine slow enough to expose it. Use a latch or a channel.
  • Holding a lock across an unbounded operation. Network I/O, disk I/O, third-party callbacks, anything user-input-blocking. Either release the lock first or use a non-blocking primitive.
  • Premature lock-freedom. Lock-free code is famously hard to write correctly; the speedup over a well-engineered mutex is often modest (1.5-3x) unless contention is severe. Measure before going lock-free. Cliff Click’s quip: “Lock-free programming is for people who think writing correct concurrent code with locks is too easy.”
  • The mutex-per-field anti-pattern. Giving each field its own mutex feels granular, but it makes cross-field invariants un-enforceable and creates deadlock-by-lock-ordering risk. Prefer one mutex protecting a coherent group of fields.
  • Treating async as “free parallelism.” await inside a loop sequentially awaits each iteration — exactly the opposite of parallel. asyncio.gather, tokio::join!, Promise.all are the parallel idioms.

18b. When to pick which model

A rough decision guide for new code:

  • Few, long-running, mostly-independent tasks with simple I/O — OS threads. Don’t over-engineer.
  • Many concurrent connections, mostly I/O-bound — async/await (tokio in Rust, asyncio in Python, Node.js, Java virtual threads).
  • Heavy CPU work on multiple cores — thread pools + work-stealing (rayon, ForkJoinPool, OpenMP), or process pools if the GIL applies.
  • Many interacting stateful entities (game NPCs, IoT devices, distributed system components) — actors (Erlang, Akka/Pekko, Orleans, Ray).
  • Pipeline of stages with high throughput — channels (Go) or the Disruptor pattern.
  • Read-mostly shared cache — RwLock or RCU.
  • Many tiny atomic updatesatomic<T> or lock-free structure; check for contention first.
  • Strict serializability across complex operations — STM (if available) or coarse-grained locks.

The right answer is often a hybrid: a service might use async I/O for network, a thread pool for CPU-bound computation, and channels to communicate between the two halves. Hybrid is fine; just be deliberate about the boundaries.

18c. The folklore reading list

A handful of essays, talks, and papers that every serious concurrency engineer eventually reads:

  • Herb Sutter, “The Free Lunch Is Over” (Dr. Dobb’s 2005) — the era-defining essay.
  • Rob Pike, “Concurrency Is Not Parallelism” (Heroku Waza 2012) — the conceptual distinction.
  • Joe Armstrong’s PhD thesis, “Making reliable distributed systems in the presence of software errors” (Swedish Royal Institute of Technology 2003) — the actor/let-it-crash philosophy in long form.
  • Doug Lea, “Concurrent Programming in Java: Design Principles and Patterns,” 2nd ed., Addison-Wesley 1999 — JSR-166’s intellectual foundation.
  • Paul McKenney, “Is Parallel Programming Hard, And, If So, What Can You Do About It?” (perfbook, open online text, continuously updated since 2011) — the deepest free resource on low-level concurrency, written by the maintainer of Linux RCU.
  • Hans Boehm + Sarita Adve’s CACM 2010 piece on memory models — the case for why the C/C++/Java memory model debates mattered.
  • Martin Thompson’s “Mechanical Sympathy” blog — the hardware-aware school of low-latency thinking.
  • Kyle Kingsbury (Aphyr), the “Jepsen” series of distributed-system correctness tests — concurrency at the database level, brutally tested.
  • Bartosz Milewski, “Category Theory for Programmers” — for the algebraic view of composition, including concurrent composition.

These are the canon; reading them is how vocabulary like “happens-before,” “let it crash,” “mechanical sympathy,” “linearizability vs serializability,” and “function coloring” enters working memory.

19. Citations

  • Herlihy, Maurice and Nir Shavit. The Art of Multiprocessor Programming, 2nd ed. Morgan Kaufmann, 2020. ISBN 978-0124159501.
  • Sutter, Herb. “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.” Dr. Dobb’s Journal, March 2005.
  • Hoare, C.A.R. “Communicating Sequential Processes.” Communications of the ACM, 21(8):666-677, August 1978.
  • Hewitt, Carl, Peter Bishop, and Richard Steiger. “A Universal Modular Actor Formalism for Artificial Intelligence.” IJCAI 1973.
  • Agha, Gul. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1986 (PhD thesis MIT AI-TR-844).
  • Dijkstra, Edsger W. “Cooperating Sequential Processes.” EWD-123, Technological University, Eindhoven, 1965.
  • Michael, Maged M. and Michael L. Scott. “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.” PODC 1996.
  • Adve, Sarita V. and Hans-J. Boehm. “Memory Models: A Case for Rethinking Parallel Languages and Hardware.” Communications of the ACM, 53(8):90-101, August 2010.
  • Manson, Jeremy, William Pugh, and Sarita V. Adve. “The Java Memory Model.” POPL 2005 (JSR-133, 2004).
  • ISO/IEC 9899:2011 (C11). International Organization for Standardization, 2011.
  • ISO/IEC 14882:2011 (C++11). International Organization for Standardization, 2011.
  • Harris, Tim, Simon Marlow, Simon Peyton Jones, and Maurice Herlihy. “Composable Memory Transactions.” PPoPP 2005.
  • Treiber, R. Kent. “Systems Programming: Coping with Parallelism.” IBM Research Report RJ 5118, April 1986.
  • Michael, Maged M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE Transactions on Parallel and Distributed Systems, 15(6):491-504, June 2004.
  • McKenney, Paul E. and John D. Slingwine. “Read-Copy Update: Using Execution History to Solve Concurrency Problems.” IPDPS 1998.
  • Dean, Jeffrey and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” OSDI 2004.
  • Reeves, Glenn E. “What Really Happened on Mars?” Internal JPL memo, December 1997 (Mars Pathfinder priority inversion post-mortem).
  • Thompson, Martin, Dave Farley, Michael Barker, Patricia Gee, and Andrew Stewart. “Disruptor: High Performance Alternative to Bounded Queues for Exchanging Data Between Concurrent Threads.” LMAX technical paper, May 2011.
  • Franke, Hubertus, Rusty Russell, and Matthew Kirkwood. “Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux.” Ottawa Linux Symposium 2002.
  • Nystrom, Bob. “What Color Is Your Function?” journal.stuffwithstuff.com, February 2015.
  • Pike, Rob. “Concurrency Is Not Parallelism.” Heroku Waza talk, January 2012.
  • Serebryany, Konstantin and Timur Iskhodzhanov. “ThreadSanitizer — Data Race Detection in Practice.” WBIA 2009.
  • Pressler, Ron. JEP 444: “Virtual Threads.” Java SE 21, September 2023.
  • Gross, Sam. PEP 703: “Making the Global Interpreter Lock Optional in CPython.” Accepted October 2023.
  • Herlihy, Maurice. “Wait-Free Synchronization.” ACM TOPLAS, 13(1):124-149, January 1991.
  • Lamport, Leslie. Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley, 2002.
  • Newcombe, Chris, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. “How Amazon Web Services Uses Formal Methods.” Communications of the ACM, 58(4):66-73, April 2015.
  • Holzmann, Gerard J. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley, 2003.
  • Smith, Nathaniel J. “Notes on structured concurrency, or: Go statement considered harmful.” vorpus.org, April 2018.
  • Waldo, Jim, Geoff Wyant, Ann Wollrath, and Sam Kendall. “A Note on Distributed Computing.” Sun Microsystems Laboratories Technical Report TR-94-29, November 1994.
  • Liu, C.L. and James W. Layland. “Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment.” Journal of the ACM, 20(1):46-61, January 1973.
  • Berenson, Hal, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. “A Critique of ANSI SQL Isolation Levels.” SIGMOD 1995.
  • Cahill, Michael J., Uwe Röhm, and Alan D. Fekete. “Serializable Isolation for Snapshot Databases.” SIGMOD 2008.
  • Kung, H.T. and John T. Robinson. “On Optimistic Methods for Concurrency Control.” ACM TODS, 6(2):213-226, June 1981.
  • Eswaran, Kapali P., Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger. “The Notions of Consistency and Predicate Locks in a Database System.” Communications of the ACM, 19(11):624-633, November 1976.
  • Valiant, Leslie G. “A Bridging Model for Parallel Computation.” Communications of the ACM, 33(8):103-111, August 1990.
  • Malewicz, Grzegorz et al. “Pregel: A System for Large-Scale Graph Processing.” SIGMOD 2010.
  • Ragan-Kelley, Jonathan et al. “Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines.” SIGGRAPH 2012 (Halide).
  • Schmidt, Doug et al. Pattern-Oriented Software Architecture, Volume 2: Patterns for Concurrent and Networked Objects. Wiley, 2000.
  • Kegel, Dan. “The C10K Problem.” kegel.com, 1999 (updated through the 2000s).
  • Wilson, Will. “Testing Distributed Systems w/ Deterministic Simulation.” Strange Loop 2014 (FoundationDB).
  • Pike, Rob. “Go Proverbs.” Gopherfest 2015 keynote.
  • Click, Cliff. “A Lock-Free Wait-Free Hash Table.” Stanford EE Computer Systems Colloquium, 2007.