OS Scheduling, Memory & IPC — Compute Reference
A working reference on operating-system internals as they actually appear in production: how processes and threads are represented, how schedulers decide who runs next, how virtual memory is mapped, how processes talk to each other, and how containers, real-time, and observability layers fit on top. Focused on Linux because that is where most server work happens, with Windows NT, XNU (macOS), and the BSDs called out where they meaningfully differ.
1. At a glance
An operating system mediates between hardware and user programs. The defining boundary is between userspace (unprivileged, runs in CPU ring 3 on x86 or EL0 on ARM) and kernel space (privileged, ring 0 / EL1). User code crosses the boundary only via a controlled transition — a syscall, a fault, or an interrupt — and the kernel does any work that touches hardware, address spaces it does not own, or other processes.
Core kernel responsibilities:
- Process and thread management. Track every running task, swap CPU state on context switch, expose
fork,exec,wait, signals. - Scheduling. Decide which runnable task gets each CPU and for how long, under fairness, latency, and throughput constraints.
- Memory management. Virtualize the address space, map pages on demand, share memory between cooperating processes, evict under pressure.
- IPC. Pipes, sockets, shared memory, semaphores, signals, futexes — the primitives by which processes coordinate.
- Syscalls. The narrow, audited API surface from userspace.
- Drivers, file systems, networking. Concrete implementations on top of the above.
Modern kernels and their architectural lineage:
- Linux (Torvalds 1991, monolithic with loadable modules). The default server and embedded kernel of the 2020s. Distinguished by aggressive mainlining, a fast release cadence (~10 weeks per release), and
git-driven development. - Windows NT (Cutler 1993, hybrid). The lower layer (HAL, Executive, Object Manager, kernel) is monolithic; many subsystems (Win32, POSIX historically, WSL2 today via a Linux kernel) live above it. NTFS, ETW, and ALPC are pillars.
- XNU (Apple, 1996, hybrid). Mach 3.0 microkernel core plus a BSD personality layer (much of FreeBSD’s VFS, sockets, syscall surface) and IOKit (C++ driver framework). Used in macOS, iOS, iPadOS, tvOS, watchOS, visionOS.
- FreeBSD (1993 onward, monolithic). Permissive license, strong network stack (used as basis for Netflix’s CDN, PlayStation’s OS, Junos), capsicum capabilities, ZFS first-class.
- NetBSD (portability-focused), OpenBSD (security-focused — pledge, unveil, doas, LibreSSL, default ASLR/W^X).
- Microkernels. seL4 (formally verified, Heiser et al. NICTA, 2009 verification proof), QNX (real-time, automotive, BlackBerry), Fuchsia Zircon (Google, since 2016; Nest Hub shipped on it 2021). The microkernel design moves drivers and most services into userspace; communication is via IPC.
- Unikernels. MirageOS (OCaml, Madhavapeddy et al., Cambridge), IncludeOS (C++), HermitCore, Nanos. Compile the application and a single-address-space library OS into one bootable image — no userspace/kernel split, no context switches, fast boot.
- Exokernel (Engler, Kaashoek et al., MIT, mid-1990s — XOK, Aegis). Push almost all abstractions to user libraries; the kernel multiplexes raw hardware. Research-only; influenced unikernels and DPDK-style userspace networking.
2. Processes
A process is an instance of an executing program plus its address space, open file descriptors, signal handlers, credentials, working directory, and so on. Each is identified by a PID assigned by the kernel.
Process hierarchy: every process except PID 1 has a parent. PID 1 (init, systemd, launchd, wininit.exe) is started by the kernel during boot and adopts orphans. A zombie is a process that has exited but whose parent has not yet called wait() to reap its exit status — its task_struct lingers so the parent can read WIFEXITED/WEXITSTATUS. An orphan is a process whose parent died first; it is re-parented to PID 1.
Process creation on Linux: classical fork() duplicates the calling process, returning the child PID to the parent and 0 to the child. Internally, glibc fork() wraps clone(SIGCHLD, …). The clone() syscall (added by Aivazian/Russell in the late 1990s and now the workhorse) takes a flags bitmap that selects what to share between parent and child — address space, file descriptor table, file system info, signal handlers, namespaces. This is how both threads and containers are built on Linux.
execve() replaces the current process image with a new program, keeping the PID and most fds (subject to O_CLOEXEC). The classic UNIX idiom is fork() + execve() in the child, wait() in the parent.
Process groups, sessions, and the controlling terminal. A process group is a set of related processes (pgid) for job control — kill -PIPE -1234 sends to the whole group. A session groups process groups for terminal management; one process group is foreground, the rest background. The session leader can have a controlling terminal; signals like SIGHUP propagate through it. setsid() detaches a process from its controlling terminal, the foundation of daemonization.
Linux representation: the kernel keeps a task_struct (in include/linux/sched.h) per task — both processes and threads are tasks (Linux uses 1:1 threading; see §3). Fields include pid, tgid (thread group id, equal to PID of the main thread), state (TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_STOPPED, TASK_ZOMBIE, etc.), pointers to mm_struct (address space), files_struct, fs_struct, signal_struct, and the scheduler entity sched_entity used by CFS/EEVDF.
Windows representation: the EPROCESS structure (Executive Process) holds the PID, parent process id, handle table, working-set info, and a list of ETHREADs. Threads carry the schedulable state on Windows; the dispatcher does not schedule “processes” directly. Job objects (analogous to cgroups) constrain sets of processes.
Inspection: ps, top, htop, pgrep, pidstat, pstree, /proc/<pid>/, lsof. On Windows: Task Manager, Process Explorer (Sysinternals), wmic process, ETW. On macOS: Activity Monitor, ps, vmmap, sample.
Process states and transitions. Linux task states deserve a closer look because they show up everywhere in monitoring:
R(running or runnable) — on a CPU or in a runqueue.S(interruptible sleep) — blocked on a wait that can be interrupted by signals (most common: waiting on a syscall likeread,select,epoll_wait,futex).D(uninterruptible sleep) — blocked in a kernel path that cannot be interrupted; usually awaiting I/O completion. PersistentDis a strong signal of storage trouble.T(stopped) — paused by SIGSTOP / SIGTSTP, debugger, orptrace.Z(zombie) — exited, not yet reaped.X(dead) — about to be removed; rarely observed.
The wchan field (ps -o pid,state,wchan,cmd) reveals what kernel function a sleeping task is parked in — a much sharper diagnostic than top alone.
/proc is the interface. Per-pid: status, stat, maps, smaps, fd/, task/, cmdline, environ, cwd, limits, oom_score, cgroup, sched, wchan, io. System-wide: meminfo, vmstat, loadavg, stat, cpuinfo, interrupts, softirqs, pressure/. Reading /proc is a non-trivial cost at very high frequency (kernel formats text on each read); procfs-based tooling can become its own observer effect.
3. Threads
A thread is a schedulable execution context inside a process. Threads share the address space, file descriptor table, signal dispositions, and most other process resources; each has its own stack, registers, and thread-local storage.
POSIX threads (pthreads): pthread_create() spawns a thread, pthread_join() waits, pthread_detach() releases reaping, pthread_mutex_* for mutual exclusion, pthread_cond_* for condition variables.
Under Linux, pthread_create boils down to clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | CLONE_SETTLS | …, …) — the flags say “share everything that defines a thread, give me a fresh stack and TLS pointer.” NPTL (Native POSIX Thread Library), shipped with glibc 2.3.2 in 2003 by Drepper and Molnar, replaced the older LinuxThreads with proper 1:1 kernel threads, signal handling, and futex-based synchronization.
Thread-local storage is exposed in C as __thread int counter; (GCC) or thread_local (C11/C++11). Glibc implements it via the TLS ABI defined by Drepper’s 2003 paper; the FS or GS segment register on x86-64 points to the TLS block. pthread_key_create / pthread_getspecific / pthread_setspecific provide a slower, portable, dynamically-allocated alternative.
M:N threading. Historically, some systems multiplexed many user threads onto fewer kernel threads — Solaris LWPs, FreeBSD’s KSE (deprecated), early Java green threads. The hope was lighter context switches, but cooperation with kernel signals, blocking syscalls, and CPU affinity proved hard. Linux NPTL settled the debate for OS-level threading on the 1:1 side.
The pendulum swung back at the language-runtime level: Go goroutines (M:N, since Go 1.0, 2012), Erlang/BEAM processes, Java 21 virtual threads / Project Loom (GA September 2023). These run thousands or millions of user-mode threads on a small kernel-thread pool, with the runtime parking/unparking on I/O. The kernel still sees 1:1 OS threads — the M:N happens above it.
Async runtimes. Single-threaded event loops (Node.js libuv, Python asyncio, Tokio in Rust) are an alternative to threads. Threads still appear under the hood for blocking I/O via worker pools.
4. Scheduling
The scheduler picks which runnable task runs on each CPU at each tick. Modern multicore systems run one scheduler instance per CPU with periodic load balancing.
Goals (always in tension)
- Throughput — work completed per unit time.
- Latency — time from a task becoming runnable to running. Critical for interactivity, UI, network servers.
- Fairness — every runnable task gets a proportional share of CPU.
- Responsiveness — interactive tasks should run soon after a wakeup.
- Energy efficiency — on laptops and phones, prefer to keep CPUs in low-power states.
These trade off. A tight RR slice maximizes fairness and latency but kills throughput via context-switch overhead. A long slice does the opposite. Real schedulers tune the balance heuristically and expose tunables.
Classical algorithms
- First-Come-First-Served (FCFS) — non-preemptive, convoy effect, terrible for interactivity.
- Shortest-Job-First (SJF) / SRTF (preemptive variant) — minimizes average wait; needs to predict burst length.
- Round-Robin (RR) — fixed time slice (quantum); fair for equal-priority tasks.
- Priority scheduling — strict priority; risks starvation without aging.
- Multilevel Feedback Queue (MLFQ) — multiple priority queues, demote CPU-bound tasks, promote I/O-bound tasks. Used in classic UNIX, Solaris, and the foundation of Windows priority scheduling.
Linux: O(1) → CFS → EEVDF
The O(1) scheduler (Ingo Molnar, 2.6.0, 2003) replaced the older O(n) scheduler with per-CPU active and expired arrays and a heuristic that estimated interactivity. It scaled to many CPUs but the interactivity heuristics were complex and tunable in unpleasant ways.
CFS — Completely Fair Scheduler (Molnar, merged in 2.6.23, October 2007). CFS replaced O(1) with a single clean idea: track each task’s accumulated virtual runtime (vruntime), and always pick the task with the smallest vruntime. vruntime advances slower for higher-priority (lower-nice) tasks, so they get proportionally more CPU. Implementation: a red-black tree keyed by vruntime, leftmost node is the next pick, O(log n) insert/remove. Tunables: sched_latency_ns (target preemption window), sched_min_granularity_ns (floor on slice). Group scheduling lets you put tasks into hierarchies (cgroups) and apply fairness at each level.
EEVDF — Earliest Eligible Virtual Deadline First (Stoica and Abdel-Wahab, “Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation,” 1995). EEVDF generalizes CFS by associating each task with a lag (how much over- or under-CPU it has received versus its fair share) and a virtual deadline. A task is eligible when its lag is non-negative; among eligible tasks, the one with the earliest virtual deadline runs next. Slices are derived from a per-task latency goal (sched_setattr sched_runtime and friends), giving low-latency tasks shorter, more frequent slices.
Peter Zijlstra rewrote Linux’s fair-class scheduler around EEVDF; it shipped in mainline Linux 6.6 (October 2023), replacing CFS. The user-visible API is mostly the same, but the policy is sharper at honoring latency requests and at handling sleepers/wakers without ad-hoc bonuses.
Load balancing across CPUs
Each CPU has its own runqueue. The kernel periodically pulls tasks from busy to idle CPUs along scheduling domain boundaries — sibling SMT threads first (cheapest migration, shared L1/L2), then cores on the same LLC, then sockets, then NUMA nodes (most expensive). sched_domain hierarchy is exposed under /sys/kernel/debug/sched/domains/. Wakeup balancing picks a CPU at the moment a task becomes runnable; periodic balancing rebalances when no wakeup happens. WAKE_AFFINE heuristics try to wake the wakee on the waker’s CPU to exploit cache warmth, gated by load and idle state.
Real-time scheduling classes
Linux exposes per-policy classes via sched_setscheduler:
- SCHED_OTHER (a.k.a. SCHED_NORMAL) — the default, served by CFS/EEVDF.
- SCHED_BATCH — for CPU-bound, non-interactive work; cheaper wakeups.
- SCHED_IDLE — runs only when nothing else does.
- SCHED_FIFO — fixed-priority real-time; runs until it blocks or a higher-priority FIFO task preempts. Priority 1–99 (higher = more urgent).
- SCHED_RR — like FIFO but with round-robin within a priority level.
- SCHED_DEADLINE — Earliest Deadline First (EDF) with Constant Bandwidth Server (CBS) admission control. Added by Dario Faggioli et al. in Linux 3.14 (March 2014). Tasks declare
(runtime, deadline, period)triples; the kernel admits only feasible sets.
Windows
Windows uses a priority-driven preemptive scheduler with 32 priority levels (0–31): 0–15 are dynamic (“variable”), 16–31 are real-time (fixed). The dispatcher maintains the dispatcher database, including per-CPU ready queues. The balance set manager swaps and rebalances. Priority boosts kick in for I/O completion, GUI foreground, and starvation avoidance (anti-starvation boost every ~4 seconds).
Processor groups (introduced in Windows 7) allow >64 logical processors by partitioning CPUs into groups of up to 64. CPU sets (since Windows 10) are a soft-affinity mechanism letting an app reserve CPUs for itself. Job objects with CPU rate control provide cgroup-like resource limits.
macOS / XNU
macOS schedules threads, not tasks, with a Mach-derived multilevel priority scheme and BSD-style timesharing on top. The user-visible knob is Quality of Service (QoS) classes introduced in OS X 10.10 (2014):
USER_INTERACTIVE— UI, animations.USER_INITIATED— user-triggered, blocks UI.UTILITY— long, user-visible progress.BACKGROUND— invisible, deferrable.
QoS controls priority, scheduling timer coalescing, I/O priority, and (on Apple Silicon) core selection between Performance (P) and Efficiency (E) cores. The kernel cooperates with kernel_task and a thermal pressure feedback loop. Apple’s “Asymmetric Multi-Processing (AMP) scheduler” replaced the older SMP scheduler when M1 shipped in 2020. QoS is propagated across dispatch_async (Grand Central Dispatch) calls and XPC IPC, so a background helper inherits its caller’s priority intent.
CPU affinity and isolation
taskset/sched_setaffinity— pin a process to a CPU subset.cpusetcgroup (cgroup v1andv2) — restrict groups of tasks to a CPU subset and a memory-node subset.isolcpus=kernel parameter — at boot, exclude CPUs from the general scheduler; only tasks explicitly pinned land there.nohz_full=— disable the periodic timer tick on the listed CPUs while running a single task. Combined withrcu_nocbs=and busy-poll networking, used by HPC and HFT shops to get sub-microsecond jitter.- IRQ affinity —
/proc/irq/<n>/smp_affinity— pin interrupts off your hot CPUs.
5. Virtual memory
Every process sees a private linear address space; the MMU translates virtual addresses (VAs) to physical addresses (PAs) using page tables.
Page tables
- x86-64: 4-level paging since AMD64 was specified (PML4, PDPT, PD, PT), with 48-bit virtual addresses. 5-level paging (LA57, since Intel Ice Lake server and Linux 4.14 in 2017) adds PML5, extending to 57-bit VAs and 4 PiB. Page sizes: 4 KiB (standard), 2 MiB (huge), 1 GiB (gigantic).
- ARM AArch64: 4-level translation by default, configurable 39/42/48-bit VAs; supports 4 KiB / 16 KiB / 64 KiB granules. Apple Silicon uses 16 KiB pages.
- Each level holds entries with a physical-frame number and permission/attribute bits (present, writable, user, no-execute, accessed, dirty, PAT/MAIR for caching).
- The base of the table tree lives in CR3 (x86) or TTBRn_EL1 (ARM). A context switch loads a new CR3 — flushing TLB entries unless PCID/ASID is in use.
TLB
The Translation Lookaside Buffer caches recent VA→PA translations. Misses incur a page-table walk (4–5 memory accesses on x86-64). TLBs are small (hundreds of entries) and per-core. Without PCID (Process-Context ID) on x86 or ASID on ARM, every context switch invalidates the TLB; with them, entries are tagged and can survive switches. Kernel TLB shootdowns happen via IPIs.
Demand paging
Pages are not actually allocated until first touch. The kernel sets up VMAs (Virtual Memory Areas — vm_area_struct on Linux) marking ranges with permissions and backing, then lets the first access fault. The page-fault handler decides: (a) zero a fresh anonymous page, (b) read from a file (file-backed VMA), (c) copy-on-write a shared page, or (d) deliver SIGSEGV if the access was illegal.
Copy-on-write (COW)
fork() does not duplicate every page; instead it marks both parent and child page tables read-only with reference counts, and only on a write does the kernel actually copy. mmap(MAP_PRIVATE) over a file gives a private writable view via the same mechanism.
The Zygote pattern (Android Dalvik/ART, Chrome’s renderer, some Python web frameworks like uWSGI’s preforking) exploits this: warm up an interpreter, load common libraries, then fork() per request. Most pages are shared, only modifications get copied.
Swap
Linux can page out anonymous memory to a swap partition or swapfile under pressure. In dedicated server workloads, swap is often disabled or set to small (vm.swappiness=10) — modern services would rather OOM-kill than thrash. On Android and ChromeOS, zRAM swaps to a compressed in-memory device, trading CPU for RAM. Windows uses paging files (pagefile.sys); macOS uses dynamic pager.
NUMA
Multi-socket servers and modern AMD Epyc/Threadripper chips (chiplet-based) have Non-Uniform Memory Access: each CPU socket (or CCD) has local memory; access to remote memory crosses an interconnect (UPI on Intel, Infinity Fabric on AMD) and is 1.5–3× slower with lower bandwidth.
numactl --hardwarelists nodes and distances.numactl --cpunodebind=0 --membind=0 ./apppins both CPU and memory to node 0.libnumaexposesnuma_alloc_onnode,numa_set_preferred, etc.- Automatic NUMA balancing (Linux 3.8, March 2013, by Mel Gorman and Peter Zijlstra) periodically samples by un-mapping pages, then on the resulting fault decides whether to migrate the page or the task to keep them co-located.
Huge pages
Larger page sizes reduce TLB pressure and page-table memory.
- Explicit / HugeTLB: reserved at boot or via
vm.nr_hugepages, accessed viaMAP_HUGETLBorhugetlbfs. Required by Oracle, large JVMs, DPDK. - Transparent Huge Pages (THP): kernel opportunistically promotes 4 KiB allocations to 2 MiB. Modes:
always,madvise,never. THP can cause latency spikes during compaction andkhugepagedscans; databases (Redis, PostgreSQL, MongoDB) generally recommend setting it tomadviseornever.
Address-space layout
A typical Linux x86-64 process map (/proc/<pid>/maps) reads roughly: text and read-only data at the binary’s load address (PIE-randomized under ASLR), then read-write data, then the heap (brk), then anonymous and file-backed mmap regions growing downward from the top of userspace, then the thread stacks, then the vDSO and vsyscall page just under the kernel boundary. ASLR (Address Space Layout Randomization) is on by default since the mid-2000s; kASLR randomizes the kernel base too (since 3.14).
User/kernel split on x86-64 with 4-level paging: 0x0000_0000_0000_0000-0x0000_7fff_ffff_ffff user (128 TiB), a huge non-canonical hole, then 0xffff_8000_0000_0000-0xffff_ffff_ffff_ffff kernel (128 TiB). With 5-level paging the user half grows to 64 PiB and the kernel to 64 PiB. Linear mapping (direct map, __va/__pa) covers all physical RAM in the kernel half; vmalloc area, module area, and kernel text sit at fixed offsets.
Memory pressure and reclaim
kswapdruns per-NUMA-node, reclaiming pages when free memory falls belowlowwatermark.- PSI — Pressure Stall Information (Linux 4.20, Suren Baghdasaryan / Johannes Weiner, 2018) exposes per-resource (CPU, memory, I/O) stall percentages in
/proc/pressure/and per-cgroup, enabling adaptive load-shedding. - OOM killer scores tasks by
oom_score(a function of RSS andoom_score_adj) and kills the worst offender. cgroup v2 memory.max triggers a per-cgroup OOM. - MGLRU (Multi-Gen LRU, Yu Zhao at Google, Linux 6.1, 2022) replaced the classic two-list LRU with a generational structure for better scan behavior under pressure.
6. Memory allocators
User-space allocators sit between the program (malloc/free, new/delete) and the kernel (mmap, brk).
- glibc ptmalloc2 — derived from Doug Lea’s malloc, with per-thread arenas added by Wolfram Gloger. Default on most Linux distributions.
- musl mallocng — written by Rich Felker; replaced the older musl allocator in mid-2020. Smaller binary, deterministic, hardened.
- tcmalloc (Google, Sanjay Ghemawat, ~2005; “TCMalloc: Thread-Caching Malloc”). Per-thread caches with central free lists; very fast
malloc/freepaths. - jemalloc (Jason Evans, FreeBSD 2005, then Facebook 2010, Mozilla 2008). Arenas, size classes, decay-based purging. Default in FreeBSD libc, used heavily by Cassandra, Aerospike, ScyllaDB, Rust until 2019.
- mimalloc (Daan Leijen at Microsoft, 2019). Free-list sharding, page-based heaps, secure mode. Used in .NET 5+ and many Microsoft services.
- rpmalloc (Mattias Jansson). Lock-free, optimized for game engines.
- Hoard (Berger et al., 2000) — academic ancestor of many modern scalable allocators.
Long-running services (especially ones with many threads and bursty allocation) often see memory fragmentation with the default glibc allocator. Mitigations: switch to jemalloc/mimalloc, lower MALLOC_ARENA_MAX (caps the number of per-thread arenas, trading concurrency for footprint), call malloc_trim() periodically.
Linux kernel allocators
- Buddy allocator — page-level allocator, manages physical pages in power-of-two blocks per migrate type (UNMOVABLE, MOVABLE, RECLAIMABLE, CMA).
- SLAB (Bonwick, Solaris 1994, ported to Linux 2.2) — caches of fixed-size objects.
- SLUB (Christoph Lameter, Linux 2.6.22, 2007) — simpler than SLAB, partial-slab lists per CPU, default since 2.6.23 era.
- SLOB — minimal, for very small embedded systems.
- kmalloc / kmem_cache_alloc — SLUB-backed; vmalloc — virtually contiguous but physically scattered (slower, used for large allocations that don’t need physical contiguity).
- Per-CPU caches and magazines (Bonwick + Adams, 2001) keep hot lists local to each CPU, only falling back to shared structures on miss/overflow.
7. Syscalls
A syscall is the user→kernel transition. On x86-64 Linux the calling convention is: number in RAX, arguments in RDI, RSI, RDX, R10, R8, R9, instruction syscall. The CPU jumps to the address in MSR LSTAR, switches privilege, runs the kernel handler, and returns via sysretq. On ARM64 the equivalent is SVC #0.
Linux has ~400 syscalls, numbered per architecture (see arch/x86/entry/syscalls/syscall_64.tbl). The ABI is stable across kernels — userspace built on a 2.6 kernel runs on a 6.x kernel — a core promise of the Linux project.
vDSO
The virtual dynamic shared object is a small ELF mapped into every process by the kernel (/proc/<pid>/maps shows [vdso]). It exports a handful of cheap functions (clock_gettime, gettimeofday, getcpu, time) that read shared kernel state directly without a privilege transition. A clock_gettime call that would cost ~200 ns as a syscall costs ~10 ns through vDSO.
io_uring
io_uring (Jens Axboe, Linux 5.1, May 2019) is a shared-memory ring-buffer interface between userspace and kernel for asynchronous I/O. Submission and completion queues live in mmap’d memory; the application writes SQEs, calls io_uring_enter (or relies on SQPOLL kernel-side polling), and reaps CQEs. It supports almost every I/O syscall — read, write, openat, accept, connect, send, recv, fsync, splice, fallocate, statx, recvmsg — plus linked operations, fixed buffers, fixed files, multi-shot poll, registered eventfds, and zero-copy send. Reduces per-op overhead dramatically vs epoll + per-op syscall, especially at high IOPS. Adopted by ScyllaDB, MongoDB, RocksDB, libuv (Node.js), Tokio (Rust), Netty.
epoll, kqueue, IOCP — the readiness/completion split
Before io_uring, the dominant high-throughput I/O multiplexer on Linux was epoll (Davide Libenzi, 2.5.45, 2002): epoll_create1, epoll_ctl to add/modify/remove fds, epoll_wait to retrieve ready events. Edge-triggered (EPOLLET) mode notifies only on state change, requiring drain-until-EAGAIN — efficient but easy to get wrong. Level-triggered (default) re-notifies as long as the fd is ready.
Other UNIXes use kqueue (Jonathan Lemon, FreeBSD 4.1, 2000), exposing a richer event-filter model than epoll — files, signals, processes, timers, vnodes all unified. Inherited by macOS and the other BSDs. Windows uses IOCP (I/O Completion Ports) — a completion model rather than readiness, where the kernel notifies after I/O has finished, conceptually closer to io_uring than to epoll/kqueue.
eBPF
eBPF (extended Berkeley Packet Filter) is a sandboxed, verified bytecode VM that runs in the kernel. Programs are loaded via bpf() syscall, verified for termination and memory safety, JITed to native code, and attached to hooks: kprobes, uprobes, tracepoints, perf events, XDP (early packet processing), TC (traffic control), cgroup, sock_ops, LSM hooks. They communicate with userspace via BPF maps (hash, array, perf-event ring buffer, ringbuf).
Originating with Steven McCanne and Van Jacobson’s BPF (1992); extended by Alexei Starovoitov and Daniel Borkmann from 2014 onward. Major consumers:
- bcc and bpftrace — tracing front-ends.
- Cilium — Kubernetes networking and security via eBPF and XDP (Isovalent).
- Falco — runtime security (Sysdig).
- Pixie — observability (New Relic).
- Tetragon — kernel-level runtime enforcement (Isovalent).
- Katran — Facebook’s L4 load balancer.
- systemd-networkd, Cloudflare’s L4Drop, etc.
8. IPC
How processes talk to each other.
Pipes
- Anonymous pipes —
pipe(int fds[2])returns a reader and writer fd, inherited acrossfork. The pattern behind shell|. - Named pipes (FIFOs) —
mkfifocreates a special file in the file system; any process with read/write permission can open it.
Both are byte streams, kernel-buffered (default 64 KiB on Linux, tunable with fcntl(F_SETPIPE_SZ)), with EPIPE/SIGPIPE on broken pipe.
UNIX domain sockets
AF_UNIX sockets are local, faster than TCP loopback, support both stream (SOCK_STREAM) and datagram (SOCK_DGRAM) and packet (SOCK_SEQPACKET); they can pass file descriptors via SCM_RIGHTS and credentials via SCM_CREDENTIALS. Used by Docker, containerd, systemd, X11, Wayland, PostgreSQL local connections, NGINX-FastCGI/uWSGI bridges. Abstract namespace (@name) avoids needing a path on disk.
Shared memory
- POSIX shared memory:
shm_opencreates a file under/dev/shm,ftruncatesizes it,mmapmaps it. Survives the creator. - System V shared memory:
shmget+shmat; older, key-based, persists untilshmctl(IPC_RMID)even on no attachers (a footgun). - memfd_create (Linux 3.17) — anonymous in-memory fd, supports sealing (
F_SEAL_WRITE,F_SEAL_SHRINK).
Shared memory is the fastest IPC but requires explicit synchronization (mutex/futex/atomic) and careful schema/version handling.
Message queues
- POSIX:
mq_open,mq_send,mq_receive. Fixed max message size, priority-ordered. - System V:
msgget,msgsnd,msgrcv. Typed messages, older API.
Semaphores and futexes
- POSIX semaphores:
sem_open(named) andsem_init(unnamed); counting semaphores. - System V semaphores:
semget,semop; arrays of semaphores, withSEM_UNDOfor crash safety (and ample footguns). - futex — Fast Userspace muTEX, the foundation of nearly all modern Linux synchronization.
futex(addr, FUTEX_WAIT, val, timeout)sleeps if*addr == val;FUTEX_WAKEwakes waiters. The fast path is entirely userspace atomic CAS — kernel is only entered on contention. Designed by Drepper, Franke, and Russell, in Linux 2.5.7 (2002). pthread mutexes, condvars, rwlocks, semaphores; Glibc, musl; Java’sLockSupport.park; Rust’sstd::sync::Mutex(parking_lot/futex); all sit on futex.
Signals
Asynchronous notifications: SIGINT (Ctrl-C), SIGTERM, SIGKILL (uncatchable), SIGSEGV, SIGPIPE, SIGCHLD, SIGUSR1/2, etc. Delivered via signal handlers (set up with sigaction). The set of async-signal-safe functions you can call from a handler is small (man signal-safety(7)). signalfd (Linux 2.6.22) exposes signals as a readable fd, easier to integrate with event loops than handlers. eventfd is a counter fd for wakeups; timerfd is a fd-driven timer.
Memory ordering and atomics
IPC and synchronization sit on top of the CPU’s memory model. x86-64 is TSO (Total Store Order): loads do not reorder past loads, stores do not reorder past stores, but stores may be buffered past later loads (StoreLoad reordering). ARM and POWER are weak: any pair can be reordered absent a barrier. C11 / C++11 std::atomic provides portable memory orders — relaxed, consume (rarely correctly used), acquire, release, acq_rel, seq_cst. Linux kernel uses its own model (READ_ONCE, WRITE_ONCE, smp_mb, smp_rmb, smp_wmb, smp_load_acquire, smp_store_release, RCU). Get this wrong and you get rare, hardware-dependent corruption that only shows up on ARM servers or under stress.
Higher-level IPC
- D-Bus — desktop Linux session/system bus for inter-application messaging (NetworkManager, systemd-logind, GNOME apps). Newer kdbus/BUS1 efforts to push it into the kernel were abandoned.
- Mach ports (macOS XNU) — capability-style typed message ports. Foundation of macOS IPC and XPC, Apple’s high-level IPC framework used between app extensions, helpers, and daemons.
- Windows IPC:
- Named pipes —
\\.\pipe\name, both local and remote. - Mailslots — one-way broadcast; rarely used today.
- ALPC (Advanced Local Procedure Call) — the modern, internal Windows IPC, used between subsystems, services, and RPC over local transport. Officially undocumented (but well reverse-engineered).
- COM/DCOM, WCF, gRPC, named pipes for SQL Server, etc., layer on top.
- Named pipes —
9. File systems
Brief tour — the topic deserves its own reference.
- Linux:
- ext4 — workhorse default of most distributions; journal, extents, delayed allocation. Direct descendant of ext2/ext3.
- XFS — SGI 1994, ported to Linux 2001, default on RHEL 7+ (since 2014). Excellent at large files and high parallelism.
- Btrfs — copy-on-write, snapshots, RAID-like profiles, subvolumes. Default on openSUSE, Fedora workstation (since 33), and SUSE Enterprise.
- ZFS — copy-on-write, snapshots, end-to-end checksumming, send/receive, native encryption. Out-of-tree on Linux (OpenZFS, CDDL vs GPL licensing). Default on TrueNAS, Proxmox option, FreeBSD root option.
- bcachefs — Kent Overstreet’s COW FS; merged in Linux 6.7 (2024).
- F2FS — flash-friendly, used in Android.
- tmpfs — RAM-backed;
/tmp,/dev/shm,/run.
- Windows:
- NTFS — journaling, ACLs, alternate data streams, USN journal, MFT.
- ReFS — resilient FS, integrity streams, designed for Storage Spaces; server-focused.
- FAT32 / exFAT — interop / removable media.
- macOS:
- APFS — replaced HFS+ in 2017 (macOS 10.13 High Sierra). COW, cloning, snapshots, encryption, container/volume model.
- Distributed / parallel:
- Ceph — RADOS object store with RGW (S3), RBD (block), CephFS (POSIX).
- GlusterFS — distributed POSIX FS; declining use.
- Lustre — HPC parallel FS; many top-500 sites.
- BeeGFS — HPC parallel FS, easier ops than Lustre.
- GPFS / IBM Storage Scale — proprietary HPC and enterprise FS.
- HDFS — Hadoop FS, append-mostly, large blocks.
- Cloud object stores (not POSIX):
- S3, GCS, Azure Blob, R2, B2. Eventual or strong read-after-write consistency, HTTP API, lifecycle policies.
- POSIX FUSE gateways exist (
s3fs,gcsfuse, Mountpoint for S3) but should not be treated as full file systems.
10. Container fundamentals
Linux containers are not a single feature but a combination of three independent kernel mechanisms.
Namespaces
Each namespace virtualizes one slice of the global system, so processes inside see only their slice.
- mnt (mount points; 2.4.19, 2002) — first namespace.
- pid (process IDs; 2.6.24, 2008).
- net (network stack — interfaces, routes, sockets; 2.6.24).
- ipc (System V IPC, POSIX message queues; 2.6.19).
- uts (hostname, domain name; 2.6.19).
- user (UID/GID mapping; 3.8, 2013).
- cgroup (cgroup view; 4.6, 2016).
- time (CLOCK_MONOTONIC / BOOTTIME offsets; 5.6, 2020).
Created via unshare(), clone(CLONE_NEW*), setns(). macOS and Windows take other paths to containerization — macOS doesn’t have namespaces in the Linux sense (Docker on macOS uses a Linux VM); Windows has Server Containers and Hyper-V containers with its own kernel-level isolation, plus WSL2.
cgroups
Control groups account for and limit resources per group of tasks: CPU, memory, I/O, PIDs, network, devices, hugetlb, RDMA.
- cgroup v1 (since 2.6.24, 2008) had one hierarchy per controller; mounts ended up incoherent.
- cgroup v2 (since 4.5, 2016, by Tejun Heo) unifies into a single hierarchy with per-cgroup
cpu.max,memory.max,io.max,pids.max. Default in RHEL 9 (2022), Ubuntu 22.04, Debian 11+, Fedora 31+. - PSI integration; better memory accounting via
memory.events,memory.pressure.
seccomp-bpf
Filters syscalls per process. Originally seccomp-strict (only read/write/exit/sigreturn), then seccomp-bpf (Will Drewry, 3.5, 2012) lets you install a BPF program that decides per-syscall: allow, deny (errno), trap, kill, trace, log. Docker and Kubernetes use a curated default profile blocking ~70 dangerous syscalls (mount, kexec_load, bpf, add_key, etc.). Custom profiles tighten further; libseccomp provides a higher-level API.
Capabilities
Linux capabilities (POSIX.1e-derived) split root into ~40 fine-grained bits: CAP_NET_ADMIN, CAP_SYS_ADMIN (the kitchen-sink one), CAP_SYS_PTRACE, CAP_CHOWN, CAP_NET_BIND_SERVICE, etc. Containers usually drop most and add only what is needed. The setcap/getcap utilities manage file capabilities.
Filesystem layering
OverlayFS (Miklos Szeredi, mainlined 3.18, 2014) provides union mounts — a writable upper directory over a read-only lower directory. Docker images stack layers via OverlayFS. idmapped mounts (5.12, 2021) shift UIDs at mount time, simplifying user-namespace + bind-mount scenarios.
Runtimes
- runc — reference OCI runtime, Go, spawned from libcontainer.
- crun — C reimplementation, faster, used by Podman.
- gVisor (Google) — userspace kernel via ptrace/KVM for stronger isolation.
- Kata Containers — lightweight VM per container.
- containerd — higher-level runtime; bundled with Docker, used directly by Kubernetes.
- CRI-O — Kubernetes-only runtime.
- Docker / Podman / nerdctl — CLIs on top.
Rootless containers and userns gotchas
Rootless containers (Podman, rootless Docker, Kubernetes user-namespaced pods) run the container runtime as an unprivileged user via newuidmap / newgidmap (subuid / subgid mappings in /etc/subuid and /etc/subgid). The container’s “root” maps to a high-numbered host UID. Side-effects: some operations are restricted (binding to ports <1024 needs net.ipv4.ip_unprivileged_port_start lowered or CAP_NET_BIND_SERVICE via file caps), and bind mounts need to be UID-shifted via idmapped mounts to avoid ownership confusion.
11. Kernel modules and drivers
A kernel module is a relocatable object (.ko) loaded into the running kernel at runtime, exporting symbols and registering with subsystems. Driver classes:
- Character devices — byte-stream-like; serial ports, TTYs,
/dev/random, GPIO. - Block devices — fixed-size blocks; disks, SSDs, NVMe.
- Network devices —
netdevinterface; NICs, virtual interfaces.
Toolchain: Kbuild Makefiles, insmod/modprobe to load, rmmod to unload, lsmod and /proc/modules to list, modinfo to introspect, dmesg for kernel log. /sys/ (sysfs, since 2.6) exposes kernel objects as a filesystem; udev rules under /etc/udev/rules.d/ react to events (USB inserted, disk appeared) and create /dev/ nodes or run scripts. DKMS (Dynamic Kernel Module Support) rebuilds out-of-tree modules against each installed kernel — used for NVIDIA, ZFS, V4L2 webcam drivers.
Modern kernel-bypass alternatives:
- DPDK — userspace network drivers + ring buffers,
igb_uio/vfio-pcibound NICs, polled, no syscalls. Used by Open vSwitch, VPP, Pktgen. - SPDK — userspace NVMe/iSCSI/NBD/VFIO drivers, polled.
- VFIO — IOMMU-backed userspace device access; foundation of PCIe passthrough, QEMU PCI passthrough.
Block layer and I/O schedulers
Above raw device drivers sits the Linux block layer, which has been rewritten twice in the last decade. The legacy single-queue layer used per-device elevators (CFQ, deadline, noop) and one request queue with one spinlock. blk-mq (Jens Axboe and Christoph Hellwig, mainlined 3.13, 2014; default for all devices since 5.0, 2019) introduced per-CPU software queues and per-device hardware queues to keep up with NVMe parallelism (multiple submission queues, MSI-X per queue). Schedulers in the mq world:
- none — pass-through; default for NVMe, where queueing and ordering in the device are sufficient.
- mq-deadline — read/write deadlines, simple, low overhead. Default for SATA SSDs on many distros.
- kyber — latency-targeted, adjusts queue depths dynamically. Facebook origin.
- bfq (Budget Fair Queueing) — fairness-oriented, good for desktop interactivity. Paolo Valente’s work.
/sys/block/<dev>/queue/scheduler shows current; /sys/block/<dev>/mq/ exposes per-hardware-queue stats.
NVMe and the storage stack
NVMe (NVM Express, 1.0 spec 2011) replaced AHCI/SATA for SSDs with a PCIe-native, queue-based command interface — up to 64K submission/completion queues, each up to 64K deep. NVMe over Fabrics (NVMe-oF) runs the same protocol over RDMA, TCP, or Fibre Channel. Zoned Namespaces (ZNS) SSDs expose append-only zones, pushing wear-leveling and GC up to the host. Filesystems like F2FS, Btrfs, and modern XFS have ZNS-aware modes.
12. Power management
- ACPI (Advanced Configuration and Power Interface) — Intel/Toshiba/HP 1996; the standard for power, thermal, and configuration tables exposed by firmware. Sleep states: S0 (working, with sub-states S0ix on modern laptops — “Modern Standby”), S3 (suspend to RAM), S4 (suspend to disk / hibernate), S5 (soft off).
- CPU frequency scaling (cpufreq) — governors decide CPU frequency:
performance— max always.powersave— min always.ondemand— ramp on demand.conservative— gentler ramp.schedutil(since Linux 4.7, 2016) — scheduler-driven, uses CFS utilization signals; default on most modern setups.intel_pstate— Intel’s HWP-aware driver, bypasses cpufreq governor on supported CPUs.
- C-states — CPU idle states; deeper = lower power, higher wake latency.
cpuidleframework picks.intel_idledriver on Intel. - P-states — performance / frequency states selected within C0.
- EAS — Energy-Aware Scheduling — initially developed by ARM and Linaro for big.LITTLE asymmetric multi-core (Cortex-A53 + A57, etc.), mainlined in Linux 5.0 (2019). The scheduler uses a per-CPU energy model to place tasks on the most energy-efficient CPU that meets the performance need. On Apple Silicon, the AMP scheduler plays a similar role between P-cores and E-cores, informed by QoS.
13. Tracing and observability
Knowing what the kernel and your processes are doing.
- perf (Linux, since 2.6.31, 2009) — hardware performance counters (cycles, instructions, cache-misses, branch-misses), software events, sampling profiler, kprobes, uprobes. Front-end for many other facilities.
perf top,perf record+perf report,perf stat,perf sched,perf lock, flame graphs (Brendan Gregg’s stack-folding scripts). - ftrace — function tracer, function-graph tracer, event tracer.
/sys/kernel/tracing/interface ortrace-cmd/kernelsharkfront-end. Steven Rostedt’s work. - eBPF tools — modern goto.
- bcc (BPF Compiler Collection, IO Visor) — Python-driven, 100+ scripts (
execsnoop,opensnoop,tcpconnect,biolatency,cachestat,runqlat,funccount). - bpftrace — awk/dtrace-like one-liner DSL, JITed via LLVM.
- libbpf + CO-RE (Compile Once, Run Everywhere) — production-grade BPF programs as standalone binaries (Andrii Nakryiko’s work).
- bcc (BPF Compiler Collection, IO Visor) — Python-driven, 100+ scripts (
- LTTng (Mathieu Desnoyers, Linux Trace Toolkit Next Generation) — low-overhead static tracing for kernel and userspace; CTF traces analyzed in babeltrace, Trace Compass.
- SystemTap — script language compiled to kernel modules; predates eBPF’s tracing capabilities and is being supplanted by it.
- DTrace — Sun, 2003 (Cantrill, Shapiro, Leventhal). Pervasive on Solaris, then FreeBSD and macOS. A Linux port exists but eBPF won the Linux side. Still indispensable on Illumos and BSD.
- Windows:
- ETW (Event Tracing for Windows) — the everything-bus for kernel and userland events.
- xperf / WPA (Windows Performance Analyzer) — interpret ETW traces.
- PerfView — Microsoft .NET-focused.
- Application-level:
- OpenTelemetry — tracing, metrics, logs SDK.
- Prometheus + Grafana — metrics scraping and dashboards.
- Jaeger, Tempo, Zipkin — distributed tracing back-ends.
Security mitigations born from Spectre/Meltdown
The 2018 disclosure of Meltdown (Lipp et al., Graz) and Spectre (Kocher et al., Google Project Zero / academic teams) — speculative execution side-channels reading data across protection boundaries — forced large, painful kernel changes:
- KPTI / KAISER (Kernel Page Table Isolation) — separate page tables for user and kernel mode, mitigating Meltdown on Intel; carries a measurable syscall-cost hit, partially clawed back by PCID.
- Retpoline and later IBRS / IBPB / STIBP — indirect-branch mitigations for Spectre v2.
- SSBD (Speculative Store Bypass Disable) for v4 / Spectre-NG.
- L1TF, MDS, TAA, RIDL, ZombieLoad — successive variants and their flush-on-VM-switch /
MD_CLEARmitigations. - Retbleed (2022), Downfall (2023 Intel), Inception / SRSO (2023 AMD) — each prompted new microcode and kernel work.
Side-channel mitigations and Spectre-class fixes have a real performance cost. mitigations=off (boot parameter) disables them for trusted single-tenant workloads; never use on multi-tenant boxes.
14. Real-time and embedded
A real-time OS guarantees worst-case response time, not just average throughput.
- PREEMPT_RT — Ingo Molnar, Thomas Gleixner, Steven Rostedt, et al., started 2004 as an out-of-tree patch making nearly every kernel critical section preemptible (replacing spinlocks with mutex-based equivalents, making interrupts threaded). Mainlined progressively; the last big pieces — RT-specific locking primitives and printk rework — landed in Linux 6.12 (November 2024), completing a 20-year mainlining effort. Used in industrial automation, robotics, audio (JACK), VoIP.
- Xenomai — dual-kernel approach: a real-time co-kernel (“Cobalt”) sits beneath Linux and handles RT tasks, with Linux as a low-priority secondary domain. Versions 3 and 4 (Cobalt/EVL).
- RT-Preempt vs Xenomai trade-off — single-kernel (PREEMPT_RT) is simpler and getting close to dual-kernel latency for most workloads.
- RTEMS — Real-Time Executive for Multiprocessor Systems; aerospace heritage (used by NASA, ESA).
- FreeRTOS — Richard Barry, 2003; very small, very widespread microcontroller RTOS. AWS-stewarded since 2017.
- Zephyr — Linux Foundation project (2016 onward); modern, modular, secure-by-design RTOS for MCUs and small SoCs.
- NuttX — Greg Nutt; POSIX-like RTOS, used in PX4 drone autopilot and Sony audio devices.
- ThreadX / Microsoft Azure RTOS — long-running commercial RTOS (Express Logic, acquired by Microsoft 2019, open-sourced 2024).
- VxWorks (Wind River) — long-standing commercial RTOS in aerospace and defense.
- QNX Neutrino — commercial microkernel RTOS in automotive (BlackBerry).
15. Common pitfalls
- Premature optimization for cache locality before profiling. Modern CPUs are deeply non-intuitive; measure with
perf statfor cache misses andperf c2cfor false sharing before redesigning data structures. - Spinlock in user space. Don’t. The kernel can deschedule the lock holder mid-critical-section, and the spinner burns its quantum waiting. Use a futex-backed mutex (
pthread_mutex,std::sync::Mutex, etc.). Adaptive mutexes that briefly spin before parking are fine — that’s what glibc’s mutex already does. - Python GIL surprises. CPython’s Global Interpreter Lock serializes bytecode execution; threading is great for I/O concurrency, useless for CPU-bound parallelism. Use
multiprocessing,concurrent.futures.ProcessPoolExecutor, or release the GIL in C extensions. PEP 703 (per-interpreter GIL / no-GIL build) is being introduced experimentally in 3.13+ but is not yet the default. - Process forking with file-descriptor leak. Fds are inherited by default. Open with
O_CLOEXEC(or setFD_CLOEXECviafcntl) soexecvecloses them. Otherwise child processes hold open the parent’s database connections, sockets, log files, etc. - Memory fragmentation in long-running services. glibc’s per-thread arenas plus a workload of many sizes leads to growing RSS that never shrinks. Lower
MALLOC_ARENA_MAX, switch to jemalloc/mimalloc, or callmalloc_trim(0)periodically. Periodic process restarts (uWSGI/passenger style) are a blunt but effective fallback. - NUMA imbalance. Automatic balancing helps but is not magic. For latency-sensitive services on multi-socket boxes, pin to one NUMA node (
numactl), or run one instance per node with a load balancer in front. Database engines (Cassandra, MongoDB, Postgres on large boxes) need attention here. - THP latency spikes. As noted, set THP to
madviseorneverfor database-class workloads. - Signal handlers calling non-async-signal-safe functions. Calling
printf,malloc, or pthread routines from a signal handler can deadlock the process. The safe move is to write a byte to a self-pipe / signalfd / eventfd from the handler and process in the main loop. - CPU pinning without TLB/cache awareness. Pinning to a CPU that shares L1/L2 with a noisy neighbor can be worse than not pinning. Use
lstopo(hwloc) to see the topology, andtaskset --cpu-listwith care. - Ignoring PSI. PSI gives you actionable pressure signals before OOM kills or full thrash. Wire it into your autoscaler.
- Mixing
fork()with multithreading. Afterfork()in a multithreaded program, only the calling thread survives in the child; mutexes held by other threads stay locked. Usepthread_atforkor, better,posix_spawn/vfork+execveif you only want to exec.
15. Common pitfalls (continued — high-frequency operational issues)
- Conntrack table full. Servers behind iptables/nftables NAT or with stateful firewall accumulate connection-tracking entries; once
nf_conntrack_maxis hit, new connections are dropped withnf_conntrack: table full, dropping packetin dmesg. Tunenf_conntrack_max,nf_conntrack_buckets, and TCP timeouts; or bypass conntrack withnotrackrules for known-stateless paths. - TIME_WAIT exhaustion. Short-lived outbound connections from a small ephemeral-port range exhaust ports while sockets sit in TIME_WAIT (2*MSL, 60s default). Mitigations: increase
net.ipv4.ip_local_port_range, enabletcp_tw_reuse, use connection pooling, or use long-lived HTTP/2 connections. vm.max_map_counttoo low. Java and some databases (Elasticsearch) mmap many files; the default 65536 mapping limit overflows. Bump to 262144 or higher.fs.file-maxandnofilerlimit too low. A single high-traffic server can need hundreds of thousands of fds. Raise both system-wide (fs.file-max) and per-process (ulimit -n,LimitNOFILE=in systemd unit).- Inode exhaustion. A disk with free bytes but no free inodes returns
ENOSPCon create.df -ichecks; common with millions of small files (mail spools, build caches). XFS has dynamic inode allocation; ext4 needsmkfs.ext4 -Nat format time. - Forgetting
MSG_NOSIGNALonsend()— kernel raises SIGPIPE when the peer closed, killing the process unless ignored. Either ignore SIGPIPE process-wide, or passMSG_NOSIGNALon every send.
15a. Worked examples
A few small scenarios that tie the layers together.
A web server under load. Nginx accepts connections via accept4 on SO_REUSEPORT sockets — one listener per worker, so the kernel hashes new connections across them, avoiding the thundering-herd accept storm. Each worker uses epoll (or io_uring in newer builds) to multiplex thousands of connections. When a request needs a file, sendfile (or splice) hands the file directly from page cache to socket buffer without a userspace bounce. CPU pinning via systemd CPUAffinity= plus IRQ affinity for the NIC (one queue per worker CPU, set_irq_affinity.sh) keeps each connection on one core end-to-end. PSI watchdogs in the cgroup signal the autoscaler when memory pressure rises.
A database flushing a checkpoint. Postgres writes dirty buffers via pwrite, then fsync to force them to durable storage. The kernel’s writeback (flush-* kworkers) and the I/O scheduler (BFQ, mq-deadline, Kyber, or none for NVMe) decide ordering. On NUMA boxes, the checkpoint worker should be pinned to the same node as its shared buffers — numastat and numa_maps confirm. THP off, vm.dirty_background_ratio and vm.dirty_ratio tuned to prevent IO stalls when the page cache flushes en masse.
A container starting up. runc calls clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC | CLONE_NEWUTS | CLONE_NEWUSER | CLONE_NEWCGROUP), mounts the OverlayFS image layers, pivots the root, applies the seccomp filter, drops capabilities, and execves the entrypoint. The cgroup limits (cpu.max, memory.max) are written before execve. systemd-nspawn, podman, and runc all do essentially these steps; Docker adds containerd’s image management on top.
16. Cross-references
- _index
- concurrency-primitives
- networking-foundations
- kubernetes-deep
- database-internals
- distributed-systems-fundamentals
16a. Related concepts inside the vault
When this note references topics covered elsewhere in depth, prefer those pages for full treatment:
- Lock-free data structures, RCU, hazard pointers, ABA problem — see
[[Compute/concurrency-primitives]]. - TCP, BBR, QUIC, kernel-bypass DPDK/XDP — see
[[Compute/networking-foundations]]. - LSM trees, B-trees, WAL, MVCC — see
[[Compute/database-internals]]. - Raft, Paxos, ZooKeeper, etcd — see
[[Compute/consensus-protocols]]. - Kubernetes scheduling, CRI, CSI, CNI — see
[[Compute/kubernetes-deep]].
17. Citations
- Bovet, Daniel P. and Marco Cesati. Understanding the Linux Kernel, 3rd edition. O’Reilly, 2005. (Covers up through 2.6; structurally still useful for fundamentals.)
- Love, Robert. Linux Kernel Development, 3rd edition. Addison-Wesley, 2010.
- Tanenbaum, Andrew S. and Herbert Bos. Modern Operating Systems, 5th edition. Pearson, 2022.
- Russinovich, Mark, David Solomon, Alex Ionescu, and Andrea Allievi. Windows Internals, 7th edition, Part 1 (2017) and Part 2 (2021). Microsoft Press.
- Singh, Amit. Mac OS X Internals: A Systems Approach. Addison-Wesley, 2006. (Dated but still the deepest XNU reference.)
- Levin, Jonathan. MacOS and iOS Internals, vols. I–III. Technologeeks, 2017–2018.
- McKusick, Marshall Kirk, George V. Neville-Neil, and Robert N. M. Watson. The Design and Implementation of the FreeBSD Operating System, 2nd edition. Addison-Wesley, 2014.
- Stoica, Ion and Hussein Abdel-Wahab. “Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation.” Old Dominion University TR 95-22, 1995.
- Molnar, Ingo. “[ANNOUNCE] CFS Scheduler” lkml post, 13 April 2007.
- Zijlstra, Peter. “sched/fair: Latency-aware EEVDF scheduler” patch series, 2023; merged in Linux 6.6.
- Faggioli, Dario et al. “SCHED_DEADLINE” merge, Linux 3.14, 2014.
- Drepper, Ulrich. “ELF Handling for Thread-Local Storage.” Red Hat, 2003.
- Drepper, Ulrich and Ingo Molnar. “The Native POSIX Thread Library for Linux.” Red Hat whitepaper, 2003.
- Franke, Hubertus, Rusty Russell, and Matthew Kirkwood. “Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux.” OLS 2002.
- Axboe, Jens. “Efficient IO with io_uring.” Kernel.org, 2019.
- Starovoitov, Alexei and Daniel Borkmann. eBPF mainlining commits, Linux 3.18+, 2014 onward.
- Gorman, Mel. Understanding the Linux Virtual Memory Manager. Prentice Hall, 2004.
- Bonwick, Jeff. “The Slab Allocator: An Object-Caching Kernel Memory Allocator.” USENIX Summer 1994.
- Bonwick, Jeff and Jonathan Adams. “Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources.” USENIX 2001.
- Lameter, Christoph. “SLUB: The Unqueued Slab Allocator.” LWN.net article, 2007.
- Evans, Jason. “A Scalable Concurrent malloc(3) Implementation for FreeBSD.” BSDCan 2006.
- Ghemawat, Sanjay and Paul Menage. “TCMalloc: Thread-Caching Malloc.” Google whitepaper.
- Leijen, Daan, Benjamin Zorn, and Leonardo de Moura. “Mimalloc: Free List Sharding in Action.” Microsoft Research, APLAS 2019.
- Berger, Emery, Kathryn McKinley, Robert Blumofe, and Paul Wilson. “Hoard: A Scalable Memory Allocator for Multithreaded Applications.” ASPLOS 2000.
- Gregg, Brendan. BPF Performance Tools. Addison-Wesley, 2019.
- Gregg, Brendan. Systems Performance, 2nd edition. Addison-Wesley, 2020.
- Klein, Gerwin et al. “seL4: Formal Verification of an OS Kernel.” SOSP 2009.
- Madhavapeddy, Anil et al. “Unikernels: Library Operating Systems for the Cloud.” ASPLOS 2013.
- Engler, Dawson R., M. Frans Kaashoek, and James O’Toole. “Exokernel: An Operating System Architecture for Application-Level Resource Management.” SOSP 1995.
- The Linux
man-pagesproject (Michael Kerrisk, maintainer), particularlysched(7),cgroups(7),namespaces(7),seccomp(2),bpf(2),io_uring(7),futex(2),unix(7),signal-safety(7),mmap(2). - LWN.net (Jonathan Corbet et al.) — ongoing kernel coverage; specific articles cited in context for CFS, EEVDF, MGLRU, io_uring, PREEMPT_RT, eBPF.