Distributed Systems Fundamentals — Compute Reference

1. At a glance

A distributed system is a collection of independent computers that appears to its users as a single coherent system (Tanenbaum + van Steen 2017). More operationally: a set of processes running on physically separate machines, communicating only by passing messages over an unreliable network, with no shared memory and no globally synchronized clock. Lamport’s pragmatic definition is sharper: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” (Lamport, attributed circa 1987).

Why it matters: every system of meaningful scale is distributed. A single machine has finite CPU, memory, disk, and crucially, finite reliability — MTBF on commodity servers is measured in months, not decades. Hyperscale infrastructure (Google, Amazon, Meta) reports thousands of disk failures per day across their fleets. Distribution is therefore not an architectural choice but a survival requirement once load, fault tolerance, or geographic latency exceed single-machine bounds.

The fundamental challenges that distinguish distributed from local computation:

  • Partial failure. In a local program, a fault either crashes the whole process or it doesn’t. In a distributed system, any subset of nodes or links can fail independently while the rest continue. A request sent to a remote node may complete, fail entirely, fail after partially executing, or — most pernicious — succeed at the remote node while the response is lost. The sender cannot in general distinguish these cases.
  • Asynchrony. There is no upper bound on message delivery time in an asynchronous network. A “slow” message is indistinguishable from a “lost” message until you wait long enough, but waiting long enough is itself unbounded. Synchronous models (bounded message delay and processing time) make consensus tractable but are unrealistic; asynchronous models are realistic but provably limit what is solvable (see FLP, section 2).
  • No shared clock. Physical clocks on different nodes drift at different rates (typical crystal oscillator drift ~10 ppm, or roughly 1 second per day uncorrected). NTP synchronization narrows this to tens of milliseconds in good conditions but can be far worse under load or with adversarial paths. Without a shared notion of “now,” concepts like “which write happened first” require careful definition (see section 6 on clocks).
  • Network unreliability. Packets are dropped, delayed, duplicated, and reordered. TCP papers over some of this but cannot eliminate it — TCP connections themselves time out and reset. At the application layer, partitions (a subset of nodes unable to communicate with another subset for some duration) are a routine event. Cloudflare’s 2022 post-mortems show that even within a single datacenter, intra-rack partitions of 30–120 seconds occur weekly.

Core vocabulary used throughout this note:

  • Node. An independent process participating in the system. Often used interchangeably with “server,” “replica,” or “machine,” though strictly a node is a logical participant — one machine can host multiple nodes.
  • Message. The unit of inter-node communication. Messages may be lost, delayed, duplicated, or reordered by the network. Messages are not corrupted in well-engineered systems (TCP checksums + TLS MACs make corruption astronomically improbable, leaving Byzantine concerns to malicious actors rather than benign noise).
  • Replica. A node holding a copy of some piece of state. Replication provides durability (survive node loss) and availability (continue serving requests from surviving replicas).
  • Partition. Two meanings, both used. (1) A network partition is a temporary inability for some subset of nodes to communicate with another subset. (2) A data partition (or shard) is a horizontal slice of the dataset assigned to a particular node or replica group. Context disambiguates; this note uses “partition” for network unless qualified.
  • Fault vs failure. A fault is a deviation from correct behavior (a crashed process, a dropped message). A failure is an externally observable incorrect outcome (a request returning the wrong answer, a timeout where there should be a response). Distributed systems tolerate faults by masking them so they do not become failures.
  • Quorum. A subset of replicas large enough that any two such subsets intersect. With N replicas, a majority quorum is ⌈(N+1)/2⌉. Quorums underpin most consistency mechanisms (see section 7).
  • Coordinator. A node (often the one the client first contacts) responsible for orchestrating a request across replicas — fanning out reads/writes, collecting quorum responses, surfacing the final answer. In leaderless systems the coordinator can be any node; in leader-based systems it is the partition leader.
  • Liveness vs safety. Two orthogonal correctness properties. Safety is “nothing bad ever happens” — invariants are never violated (e.g., no two values for the same committed key). Liveness is “something good eventually happens” — progress is made (e.g., every issued request eventually completes). FLP (section 2) says one cannot have both in fully asynchronous models with crashes; production systems compromise on liveness during pathological conditions but never on safety.
  • Idempotence. An operation is idempotent if applying it more than once has the same effect as applying it once. Critical at the distributed layer because retries are unavoidable — the network can lose either request or response, the caller cannot distinguish, so the only safe retry strategy is idempotent operations or exactly-once protocols layered on top of at-least-once delivery.

2. First principles

The behavior of distributed systems is constrained by a handful of foundational theorems. Knowing them is not optional — every system you design or operate is some point in the trade-off space they define.

FLP impossibility (Fischer, Lynch, Paterson 1985)

In the canonical asynchronous model — no bound on message delay, no bound on process speed, processes can crash but not lie — there is no deterministic protocol that solves consensus in the presence of even a single crash failure. The paper “Impossibility of Distributed Consensus with One Faulty Process,” published in Journal of the ACM 32(2), proves this by adversary argument: the scheduler can always delay messages just enough to keep the system in a “bivalent” state (one where both decisions remain possible) indefinitely.

Consensus here means a protocol satisfying three properties — agreement (no two correct processes decide differently), validity (the decision is some input value, not invented), and termination (every correct process eventually decides). FLP shows that termination cannot be guaranteed in a purely asynchronous model with crashes; the other two are achievable, but a protocol may run forever during adversarial scheduling.

The practical reading is not “consensus is impossible.” Real systems achieve consensus all the time. The impossibility is escaped by relaxing one of the assumptions:

  • Add randomness. Ben-Or 1983 (“Another Advantage of Free Choice”) and Rabin 1983 give randomized protocols that terminate with probability 1 (expected termination in O(2^n) rounds for Ben-Or, O(1) for Rabin under coin-flip primitives). Used in modern BFT systems (HoneyBadger, AlgoRand).
  • Add timing assumptions. Partial synchrony (Dwork, Lynch, Stockmeyer 1988, “Consensus in the Presence of Partial Synchrony,” JACM 35(2)) — the network is asynchronous for some unknown initial period, then becomes synchronous (bounded delays) — is sufficient. This is the model Raft, Paxos, Zab, and essentially every production consensus protocol assumes. The Global Stabilization Time (GST) is the unknown moment after which the network behaves; protocols guarantee progress only after GST.
  • Add failure detectors. Chandra + Toueg 1996 (“Unreliable Failure Detectors for Reliable Distributed Systems,” JACM 43(2)) show that an “eventually strong” failure detector (◇S) is the weakest oracle sufficient to solve consensus. This is what heartbeat-and-timeout mechanisms approximate in practice.
  • Accept non-termination as a possibility. Paxos can fail to terminate during contention (dueling proposers) but always preserves safety. Production deployments use leader election to reduce contention to near-zero in the common case.

CAP theorem

Conjectured by Eric Brewer at the 2000 PODC keynote and proven by Gilbert + Lynch in ACM SIGACT News 33(2) 2002. The statement: in the presence of a network Partition, a distributed system must choose between Consistency (every read sees the most recent write, formally linearizability) and Availability (every request receives a non-error response).

CAP is widely misquoted as “pick two of three.” The accurate reading: partitions will happen — they are a property of the physical network, not a design choice — so the real choice is C-vs-A during a partition. A system that is “CA” is one that doesn’t survive partition at all; this is essentially every single-datacenter relational database, where a partition that splits the cluster causes a panic / shutdown rather than divergent answers.

Brewer’s own 2012 retrospective in IEEE Computer 45(2) cautions that CAP’s binary framing obscures the real engineering: most systems are C-leaning or A-leaning, with graceful degradation and partial availability under partition. The real-world choice is rarely all-or-nothing — a partitioned system may serve cached reads, reject writes, or fall back to a regional replica, none of which are cleanly “C” or “A” in the strict CAP sense.

A subtler point that CAP elides: the “C” in CAP is specifically linearizability, the strongest single-object consistency. Weaker consistency models (causal, session, eventual) can be both consistent-in-their-sense and available during partition. Mahajan, Alvisi, Dahlin 2011 prove that causal consistency is the strongest model achievable with availability under arbitrary partition — a far more useful frontier than “consistent or available.”

PACELC (Abadi 2012)

Daniel Abadi’s extension, published in IEEE Computer 45(2): if Partition, A vs C; else, Latency vs Consistency. PACELC captures what CAP misses — that even in the steady state (no partition), strongly consistent reads cost latency. Cross-region linearizable writes require coordination among geographically distant replicas; the speed of light alone bounds this at ~70 ms round-trip US-east-to-EU. Eventually consistent systems trade that latency for staleness.

PACELC classifications: Spanner is PC/EC (consistent during partition, consistent at cost of latency otherwise). Dynamo is PA/EL (available during partition, low-latency at cost of consistency otherwise). MongoDB (default config) is PA/EC. Cassandra is PA/EL. CockroachDB is PC/EC. DynamoDB is PA/EL by default with optional PC/EC via strong reads.

The L-vs-C trade-off in the non-partitioned case is often the more interesting one in practice — partitions are rare, but every read in a globally replicated system pays the speed-of-light tax for strong consistency. A linearizable read in Spanner crossing US-EU requires at least one ~70 ms round trip; the same read in DynamoDB eventually consistent mode is local (~1 ms). Across millions of requests per second, that’s the difference between “feasible” and “not feasible” for many workloads.

End-to-end argument (Saltzer, Reed, Clark 1984)

Published in ACM TOCS 2(4). The argument: a function should be implemented at the endpoints of a communication system rather than in intermediate stages, because only the endpoints have the information needed to do it correctly. Reliable delivery, for instance, requires application-level acknowledgment — the network layer can retransmit lost packets but cannot know whether the application has processed them. A common corollary in distributed systems: do not trust the network or any single intermediate node to enforce correctness properties that the application can verify itself.

Two Generals’ Problem

A classic thought experiment (Akkoyunlu, Ekanadham, Huber 1975 in slightly different form; popularized by Gray 1978). Two generals must agree on an attack time, communicating only via messengers that may be captured. No finite exchange of messages suffices to guarantee both agree. The lesson: deterministic agreement over an unreliable channel is impossible. All real protocols (TCP three-way handshake, two-phase commit) are best-effort — they reduce the probability of disagreement but cannot eliminate it, and at some level the application must tolerate the residual uncertainty.

The practical implication for application architects: assume that for every external call, you might not know whether it succeeded. Design APIs so that “did this happen?” is answerable — give every operation a server-generated ID that can be queried later. Pay particular attention to side effects (payments, emails, shipments) where double-execution is costly. Either make the operation idempotent on the server (the right answer) or quarantine it behind a saga / outbox pattern that ensures at-most-once via a durable record of completion.

The Byzantine Generals Problem (Lamport, Shostak, Pease 1982) extends this to faulty generals rather than faulty messengers, and is the foundation of Byzantine fault tolerance. With f malicious generals and n total, agreement is achievable iff n ≥ 3f+1 (in the worst case, with no signatures); with cryptographic signatures, n ≥ 2f+1 suffices. See [[Compute/consensus-protocols]].

3. Consistency models

Consistency models specify which interleavings of reads and writes a system permits. They form a partial order from strongest (most restrictive, most expensive) to weakest (most permissive, cheapest). See [[Compute/database-internals]] for transactional isolation levels, which are the database-side counterpart.

Linearizability (Herlihy + Wing 1990)

The strongest single-object consistency model. Published in ACM TOPLAS 12(3). An execution is linearizable if there exists a total order of operations such that: (1) each operation appears to take effect instantaneously at some point between its invocation and its response, and (2) the order respects real-time precedence — if op A completed before op B began (wall-clock), then A precedes B in the order.

Linearizability composes: any composition of linearizable objects is linearizable. This is the property that makes it the gold standard — you can reason about a linearizable distributed register essentially as if it were a single non-distributed atomic register. The cost is high — every linearizable operation requires coordination with a quorum, and cross-datacenter linearizability is bounded below by speed-of-light round-trips (Attiya + Welch 1994 lower bound: 2× one-way network delay for a read, when interleaved with writes).

Linearizable systems must handle reads carefully. A naive “read from the leader” is not linearizable if the leader has been deposed but doesn’t know it yet — a stale leader serving reads from old state will violate real-time order. Production implementations either round-trip a quorum on reads (Raft default), use leases (etcd lease-reads — leader holds a time-bounded lease and can serve reads locally only within it), or use read indexes (Raft optimization — leader confirms it’s still leader via a single round of heartbeats before serving a batch of reads).

Sequential consistency (Lamport 1979)

Published in IEEE Transactions on Computers C-28(9). Operations appear in some total order, and that order is consistent with each process’s program order. Critically, sequential consistency does not require the order to respect real-time — operations from different processes can be reordered freely so long as each process’s own operations remain in order.

Sequential consistency is the memory model underlying many older multiprocessor textbooks but is rarely the target of distributed systems, because it allows globally observable reordering that is surprising in practice. Most production systems either go stronger (linearizable) or weaker (causal+).

A concrete example of the gap: under sequential consistency, process A writes x=1 then writes y=1; process B reads y=1 then reads x=0. This is forbidden by linearizability (real-time order from A says x=1 happened before y=1, so any read of y=1 must follow a read of x=1) but permitted by sequential consistency (the total order [A’s program order, B’s program order] could place B’s reads anywhere consistent with B’s own order). For most user-facing systems this anomaly is unacceptable, hence the dominance of linearizability or eventual+session-guarantees.

Causal consistency

Operations that are causally related (Lamport’s happens-before, 1978) must be seen in causal order by all processes. Operations not causally related can be observed in any order. Causal consistency is the strongest model that remains available during network partitions (Mahajan, Alvisi, Dahlin 2011, “Consistency, Availability, and Convergence”). It is therefore a popular target for geo-distributed systems that want stronger-than-eventual guarantees.

Implementation: track causal dependencies via vector clocks or version vectors; deliver each write only after its dependencies are visible at the receiver. COPS (Lloyd et al. SOSP 2011) and Eiger (Lloyd et al. NSDI 2013) are research systems demonstrating practical causal+ at scale. Eiger extends COPS with column-family operations and read-only transactions, narrowing the gap between causal consistency and snapshot isolation.

The practical adoption of causal consistency outside research is modest — most production systems either accept eventual consistency (Dynamo, Cassandra) or pay for linearizability (Spanner, CockroachDB). Causal consistency’s middle ground is theoretically attractive but the bookkeeping overhead (per-write metadata, dependency checks) eats much of the latency benefit it promises. AntidoteDB (SyncFree project, 2014–) is the most complete causal+CRDT production system.

Eventual consistency

The minimum useful guarantee: if no new writes are made to an object, all replicas eventually return the same value. “Eventually” is unquantified — could be milliseconds, could be hours. Eventual consistency is what Dynamo (Amazon 2007) and Cassandra provide by default. To make it useful in application code, systems pair it with conflict resolution:

  • Last-Writer-Wins (LWW). Compare timestamps; latest wins. Simple, but discards concurrent writes silently. Requires synchronized clocks; even so, two writes with identical timestamps need a tiebreaker (typically replica ID). LWW is the default in Cassandra, DynamoDB, and many key-value stores; the silent-loss property is a footgun unless writes are idempotent or rare.
  • Application-level merge. Return all concurrent values (siblings) to the client; let the application decide. Used by Dynamo’s shopping cart — merge two concurrent cart versions by union of items. Effective when the merge logic is natural (set union, max counter); painful for arbitrary application state.
  • CRDTs (Conflict-free Replicated Data Types). Shapiro, Preguiça, Baquero, Zawirski INRIA 2011. Data types with merge functions that are commutative, associative, and idempotent — concurrent updates can be merged in any order and produce the same result. Two flavors:
    • State-based (CvRDT): replicas exchange full state; merge is element-wise join in a join-semilattice. Simple but bandwidth-heavy.
    • Operation-based (CmRDT): replicas exchange operations; operations must be commutative; relies on causal-broadcast delivery.
    • Examples: G-counter (grow-only counter), PN-counter (positive-negative), G-set (grow-only set), OR-set (observed-remove set, handles add+remove), LWW-register, MV-register (multi-value), RGA sequence (replicated growable array). Used in Riak (built-in counters/sets/maps), Redis CRDT modules, Automerge, Yjs (collaborative editing), AntidoteDB.
  • Operational transformation (OT). Predecessor to CRDTs for collaborative editing (Ellis + Gibbs 1989, Google Wave 2009). Transform concurrent operations against each other so each replica converges. Famously hard to get right in the general case — Wave’s complexity contributed to its discontinuation. Modern collaborative editors (Figma, Notion) lean CRDT.

Session guarantees (Terry et al. 1994, Bayou)

Weaker than causal but stronger than eventual; framed from a single client’s perspective:

  • Read-your-writes. A client always sees its own past writes. Implemented via session affinity (sticky routing) or by attaching a write-version vector to the client session.
  • Monotonic reads. Successive reads from the same client never go backward in time.
  • Monotonic writes. Writes from the same client are applied in order.
  • Writes-follow-reads. If a client reads value v then writes v’, the write is ordered after the read at all replicas.

These are inexpensive to provide and dramatically reduce user-visible anomalies compared to bare eventual consistency. Most production “eventually consistent” systems offer them as a default or as a tunable. MongoDB’s “causal consistency” mode (since 3.6) is essentially read-your-writes + monotonic-reads via cluster-time tokens passed in session metadata. DynamoDB provides read-your-writes via “strongly consistent reads” on the primary node; eventual reads may not.

Snapshot isolation + serializability

Database-level. Snapshot isolation (Berenson et al. 1995, “A Critique of ANSI SQL Isolation Levels,” SIGMOD) gives each transaction a consistent snapshot of the database at its start time; writes don’t conflict unless they touch the same key. Vulnerable to write-skew anomaly — two transactions read overlapping data, write disjoint data based on what they read, and the combined effect violates a constraint that neither violates alone.

Serializability (Eswaran, Gray, Lorie, Traiger 1976, “The Notions of Consistency and Predicate Locks in a Database System,” CACM) is the gold-standard transactional model — outcomes equivalent to some serial execution of the same transactions. Strict serializability = serializable + linearizable across transactions, i.e., the equivalent serial order respects real-time. Spanner provides strict serializability globally; most systems offer serializability within a shard. Implementations: 2PL (two-phase locking, classic but contention-prone), SSI (Serializable Snapshot Isolation, Cahill, Röhm, Fekete 2008, used by PostgreSQL), OCC (optimistic concurrency control, used by FoundationDB, CockroachDB transactions in the optimistic case). Treated in depth in [[Compute/database-internals]].

4. Replication strategies

Replication is keeping the same data on multiple nodes, for durability (survive disk/node loss) and availability (serve reads when some replicas are down) and throughput (parallel reads). Every replicated system answers two questions: who handles writes, and how do replicas converge.

Single-leader (primary-secondary, master-replica)

One designated leader receives all writes; followers replicate from the leader. The simplest and most common model — used by PostgreSQL streaming replication, MySQL binlog replication, MongoDB (with replica sets), Redis Sentinel, Kafka per-partition. The leader is the single point of write serialization, which provides linearizability for free (with proper read handling) but caps write throughput at one machine’s capacity.

Replication lag. Followers replay the leader’s log; under load or large transactions, followers fall behind. Lag is the most common operational pain in primary-secondary systems: applications that read from followers (to offload the primary) see stale data; failover after a leader crash may require waiting for followers to catch up. Monitoring replication lag is foundational ops hygiene; alerting thresholds vary from seconds (OLTP) to minutes (analytics replicas).

Synchronous vs asynchronous replication. Sync: leader waits for at least one follower to acknowledge before responding to client. Strong durability but a stuck follower stalls writes. Async: leader responds immediately; followers catch up in the background. Lower latency but a leader crash can lose recent writes. Semi-synchronous (one sync follower + N async) is a common compromise — MySQL semi-sync, PostgreSQL synchronous_standby_names with majority quorum.

Failover. When the leader fails, a new leader must be elected. This is where single-leader systems get genuinely hard: detecting failure correctly (heartbeats can be false-positive under load), choosing a successor (which follower has the most recent state?), and fencing the old leader (preventing split-brain). Many production outages trace to bad failover — GitHub’s October 2018 24-hour outage was triggered by a 43-second network partition causing a MySQL failover with state divergence.

Manual failover is the conservative choice for systems with weak built-in consensus (PostgreSQL, MySQL without Group Replication, Redis without Sentinel). Operators see the alert, verify the leader is genuinely down (not just slow), and promote a follower. Slow but safe. Automatic failover — driven by an external orchestrator (Patroni for Postgres, Orchestrator for MySQL, Sentinel for Redis) — is faster but a major outage source. The orchestrator itself must be highly available, must have a consistent view of cluster state, and must fence the old leader (typically via STONITH or storage-level fencing). The hard cases: orchestrator partitioned from cluster but cluster is healthy; orchestrator healthy but cluster brain-split; cascading failover storms when an orchestrator overreacts to transient latency.

Multi-leader

Multiple nodes accept writes; replicate to each other. Used in geo-distributed deployments where each region needs local-write latency (CouchDB, MySQL Group Replication, BDR for Postgres, Galera Cluster). Writes apply locally first then propagate; conflicts are inevitable when the same key is written in two regions concurrently. Resolution: LWW (lose data), application-level merge (complex), CRDTs (constrained schema). Multi-leader is rare in modern green-field systems — most teams prefer single-leader per region with cross-region async replication and explicit conflict avoidance via partitioning (each region owns a subset of keys).

The “calendar of conflicts” pattern: even in multi-leader, you can often arrange that the common case is conflict-free. E.g., a user’s data is written primarily in their home region; only when they travel does another region accept writes. By the time replication catches up, conflict probability is near zero. This is the design ethos of Apple iCloud’s data layer and many SaaS multi-region deployments.

Leaderless (Dynamo-style)

Coined by Amazon’s Dynamo paper (DeCandia et al. SOSP 2007). Clients (or a coordinator) write to N replicas; require W acknowledgments to consider the write durable. Reads query N replicas, accept R responses, return the most-recent (by version vector or timestamp). The invariant R + W > N guarantees read-after-write within a single replica set — every read overlaps every write at at least one node.

Common configurations with N=3: W=R=2 (quorum, default Cassandra), W=3 R=1 (read-optimized), W=1 R=3 (write-optimized). The quorum is not strict — sloppy quorums (Dynamo) allow temporary nodes to accept writes when the canonical replicas are unreachable, with hinted handoff to deliver later.

Leaderless systems trade leader-election complexity for write-time complexity. There is no single point of failure for writes; any replica can accept any write. But the coordination work — finding R or W reachable replicas, resolving concurrent versions, propagating updates — happens on the read/write path rather than during failover. Tail latency in leaderless systems is sensitive to slow replicas; hedging (parallel requests to extra replicas) is common to bound it.

State machine replication (SMR)

A foundational pattern (Schneider, ACM Computing Surveys 1990, “Implementing Fault-Tolerant Services Using the State Machine Approach”). Treat the application as a deterministic state machine; replicate the input log; each replica applies inputs in the same order and arrives at the same state. The hard part is agreeing on the log order — that’s the consensus problem (Raft, Paxos, Zab, Viewstamped Replication). Covered in detail in [[Compute/consensus-protocols]].

SMR is the foundation of every strongly-consistent system: etcd, ZooKeeper, Consul, CockroachDB, TiDB, Spanner all replicate via SMR over consensus. The replicated log is the source of truth; the state is a deterministic projection.

Determinism is non-trivial. Anything that varies per-replica — time.Now(), random numbers, iteration order over hash maps, floating-point on heterogeneous CPUs, external API calls — breaks SMR. Production state machines either capture these as inputs in the log (the leader generates the timestamp/random, replicates it) or quarantine non-determinism outside the SMR core. ZooKeeper’s design notes (Hunt et al. 2010) and Raft thesis (Ongaro 2014) both discuss this.

5. Partitioning (sharding)

Partitioning splits the dataset across nodes for horizontal scalability. Each piece of data has a primary partition; replicas of that partition live on a subset of nodes.

Partitioning schemes

  • Range partitioning. Keys are assigned to ranges (e.g., [a..f], [g..m], …). Each range goes to a node. Used by HBase, Bigtable, CockroachDB, MongoDB (shardKey).
    • Pros: range scans are efficient (one shard for sequential keys); easy to reason about; works well with composite keys where the high-cardinality prefix is the partition column.
    • Cons: hot spots when access is skewed — appending sequential keys (timestamps, autoincrement IDs) hits one shard. Mitigation: bucketize timestamps or reverse them, prepend hash prefixes, or use a uuid-suffix scheme.
  • Hash partitioning. Apply a hash to the key; assign by hash range or modulo. Distributes uniformly even for skewed key distributions but destroys range-scan locality. Used by Cassandra (Murmur3 hash), Dynamo, Riak.
    • Pros: even load distribution by default; insensitive to key skew.
    • Cons: range queries become scatter-gather across all shards; cannot efficiently scan “all keys with prefix X.”
  • Consistent hashing. Karger et al. 1997 (“Consistent Hashing and Random Trees,” STOC). Hash nodes and keys onto the same ring; a key is owned by the next node clockwise. Adding or removing a node moves only 1/N of keys, vs full reshuffling with naive modulo. Refined with virtual nodes (vnodes) to balance load across heterogeneous nodes — Cassandra defaults to 256 vnodes per physical node. Used by Dynamo, Riak, Cassandra, Memcached clients (libketama).
    • Rendezvous hashing (Thaler + Ravishankar 1996) is an alternative with similar properties but simpler implementation — compute hash(key, node) for each node, route to the node with the maximum.
  • Lookup service / directory-based. A separate metadata service maps key (or shard) → node. Used by Vitess (topology service in etcd/ZK/Consul), Citus (coordinator), MongoDB config servers, HDFS NameNode. Decouples shard placement from key hash but adds an indirection and a metadata service to operate.
    • The metadata service must itself be highly available (typically via consensus) and cached aggressively by clients. The cache invalidation problem during rebalancing is non-trivial — clients with stale routing info send requests to the wrong nodes. Typical mitigation: nodes that receive a misrouted request respond with a “not your shard” error including the new owner, prompting client cache refresh.
    • This indirection layer is also the natural point for cross-cutting concerns: access control, audit logging, rate limiting per shard. Many lookup services evolve into proxy gateways over time.

Rebalancing

When nodes are added, removed, or fail, partitions must move. Strategies:

  • Fixed number of partitions. Cassandra-style — partitions are created at cluster setup (e.g., 1024) and assigned to nodes. Adding a node steals a slice of partitions from existing nodes. Simple, predictable; the partition count is an upper bound on horizontal scalability.
  • Dynamic partitioning. HBase, Bigtable — partitions split when they exceed a size threshold (e.g., 10 GB) and merge when small. Adapts to skew but introduces split/merge events that affect latency; splitting a hot partition during peak load is risky.
  • Proportional to nodes. Some systems (Couchbase) keep partition count proportional to node count; adding a node creates partitions. Trades implementation complexity for elasticity.

Rebalancing is expensive — it involves copying data over the network, often gigabytes per partition. Operators almost always want manual control over when rebalancing happens, especially in production: automatic rebalancing on a transient node loss is the wrong default because it triggers data movement just when capacity is reduced. Cassandra defaults to manual nodetool repair/bootstrap; HBase auto-balances but with throttling.

Hot-spot mitigation

A single hot key — e.g., a celebrity’s user ID — can overwhelm the shard owning it regardless of partition strategy. Mitigations:

  • Salting / write-fanning. Append a random suffix to the partition key (e.g., user_id:k for k in 0..N-1) so writes scatter across multiple shards. Reads fan out and aggregate. Trades write contention for read amplification.
  • Caching. Aggressive caching at the read path (CDN, Redis, application memory) keeps the hot key from hitting the storage tier. Works only if the read pattern dominates writes; combine with write-through or invalidation.
  • Special-case routing. Manually pin known-hot keys to dedicated nodes with extra capacity. Operationally messy but pragmatic — Twitter’s “Manhattan” famously did this for celebrities.
  • Async aggregation. For pure counter-style hot keys (like-counts, view-counts), maintain per-shard counters and aggregate periodically rather than on every read. Trades real-time accuracy for write scalability.

Secondary indexes

A primary partition is on one key (the partition key). Queries on other attributes need secondary indexes, which interact awkwardly with partitioning:

  • Local (document-partitioned) indexes. Each partition maintains its own index over its own data. Writes are local and fast. Reads must scatter-gather across all partitions and merge results — expensive at high partition counts. Used by MongoDB, Cassandra secondary indexes, Riak.
  • Global (term-partitioned) indexes. The index itself is partitioned (by indexed attribute). Reads hit one shard. Writes require updating a possibly remote index shard — distributed transaction or async update. Used by DynamoDB GSI (eventually consistent), Spanner global indexes (transactional).
  • Hybrid. Some systems combine: local for high-cardinality attributes (where scatter cost is amortized), global for low-cardinality lookups. Elasticsearch operates this way conceptually — each shard has its own inverted index, but cross-shard search merges scores at query time.

Cross-partition transactions

Required when one logical operation touches multiple shards. Options: two-phase commit (2PC, blocking and notoriously fragile under failure), Percolator-style (Google 2010) with a centralized timestamp oracle, or avoid by design — most “NoSQL” systems force application-level coordination instead. Spanner and CockroachDB do distributed transactions using 2PC over Raft groups with TrueTime/HLC for ordering.

The 2PC protocol (Gray 1978) is: in phase 1, the coordinator asks all participants to prepare (vote yes/no); if all vote yes, in phase 2 the coordinator tells everyone to commit, otherwise to abort. 2PC is blocking — if the coordinator crashes after a prepare-yes vote but before sending commit/abort, participants are stuck holding locks until the coordinator recovers, since they cannot unilaterally decide without risking divergence. Three-phase commit (3PC) addresses this in synchronous networks but is rarely used in practice because real networks aren’t synchronous. Modern systems (Spanner, CockroachDB) replicate the coordinator’s state via consensus so a coordinator failure can be recovered by electing a new coordinator that sees the same prepared state.

6. Failure detection + clocks

Two intertwined fundamentals: how do we tell time across machines, and how do we tell whether a machine is alive.

Physical clocks

Every machine has a hardware clock — typically a crystal oscillator driving a counter. Crystal oscillators drift; nominal accuracy is ~10–100 ppm, so unsynchronized clocks diverge by seconds per day. Linux exposes two clocks per process: CLOCK_REALTIME (wall clock, can step backward via NTP adjustment) and CLOCK_MONOTONIC (monotonic, never goes backward but does not represent wall time). For measuring elapsed time, always use monotonic; for logging timestamps, use realtime; never compare realtime values from different machines as a primary ordering mechanism.

Synchronization protocols:

  • NTP (Network Time Protocol, RFC 5905). Synchronizes machine clocks to a hierarchy of reference clocks (stratum 0 = GPS/atomic, stratum 1 = directly attached, etc.). Achievable accuracy in datacenter conditions: 1–10 ms, but degrades to 50–100 ms or worse under load, congestion, or virtualization. NTP can also step the clock backward — a recipe for monotonic-counter bugs.
  • PTP (Precision Time Protocol, IEEE 1588). Hardware-assisted, microsecond-accurate (~1 µs) within a LAN with PTP-capable switches. Used in HFT, telecom, and now hyperscale datacenters.
  • TrueTime (Google Spanner). Combination of GPS + atomic clocks at every datacenter, plus a TT.now() API that returns an interval [earliest, latest] representing the bounded uncertainty (typically 7 ms, kept under 10 ms via aggressive bounding). Spanner uses TrueTime to wait out the uncertainty interval and provide externally consistent (linearizable across the globe) transactions. Described in Corbett et al. OSDI 2012.

The general rule: do not trust physical clocks for ordering events. Use them for timeouts, expiration, and log rate-limiting. For ordering, use logical clocks.

Logical clocks

  • Lamport timestamps (Lamport 1978, “Time, Clocks, and the Ordering of Events in a Distributed System,” CACM 21(7)). Each node maintains a counter; increment on every local event; on send, attach the counter; on receive, set counter = max(local, received) + 1. Lamport timestamps give a total order consistent with causality but cannot detect concurrency — two events with different timestamps may have been concurrent.
  • Vector clocks (Fidge 1988, Mattern 1989). Each node maintains a vector of counters, one per node. Increment own counter on local events; on send, attach vector; on receive, element-wise max + increment own. Two events are causally ordered iff one vector dominates the other componentwise; otherwise concurrent. Vector clocks faithfully capture happens-before but scale linearly with cluster size — used in Dynamo, Riak (sometimes with truncation/pruning).
  • Hybrid logical clocks (HLC; Kulkarni, Demirbas, Madappa, Avva, Leone 2014). Combine physical-clock-like wall time with a logical counter. Each node tracks (pt, l, c) where pt is physical time, l is the latest seen logical wall-time, c is a counter for events sharing the same l. Bounds the divergence between HLC and physical time by clock skew while preserving causal ordering. Used by CockroachDB, MongoDB (since 3.6), YugabyteDB.

Failure detectors

The fundamental challenge: in an asynchronous network, you can never definitively say “node X is down” — only “I haven’t heard from X in time T.” Algorithms:

  • Heartbeats. Each node periodically pings (or is pinged); after K missed heartbeats, declare suspect. Simple, used everywhere. Tuning is hard — too aggressive triggers false positives during GC pauses or load spikes; too lenient delays detection.
  • Phi-accrual failure detector (Hayashibara, Defago, Yared, Katayama 2004). Instead of a binary alive/dead, output a continuous suspicion level (phi) based on the distribution of inter-arrival times. Applications choose their own threshold. Used by Akka cluster, Cassandra. More robust to network jitter than fixed-timeout heartbeats.
  • SWIM (Das, Gupta, Motivala 2002). Scalable membership protocol: nodes ping a random peer per round; on no response, ask K other nodes to indirectly probe. Catches transient losses. Used by Consul, HashiCorp Serf, Memberlist.
  • Gossip-based failure detection. Each node periodically gossips its view of cluster membership with random peers; failures propagate via convergence. Used by Cassandra (combined with phi-accrual), Riak.

A subtle but important point: a failure detector cannot distinguish a crashed node from a slow node or a partitioned-but-alive node. The detector’s job is to convert “I haven’t heard from X” into a binary signal, but that signal is always provisional. Robust systems treat failure detection as advisory and rely on fencing (section 6, partition handling) to ensure that even an incorrect declaration of failure does not lead to incorrect outcomes.

Network partition handling

When the failure detector says “I lost contact with half the cluster,” what happens? Options:

  • Split-brain prevention via quorum. Only the partition containing a majority can make progress. The minority side either rejects writes (Raft, Paxos) or runs read-only.
  • Fencing. When a new leader is elected, issue a fencing token (monotonically increasing) that must be presented to shared resources (storage, locks). Stale leaders’ tokens are rejected. Critical to prevent “zombie leader” data corruption — Burrows’ Chubby paper (Google 2006) covers this.
  • Lease-based leadership. Leader holds a time-bounded lease; must renew before expiry. If leader gets partitioned, it loses leadership when the lease expires. Vulnerable to clock skew between leader and lease grantors — typically mitigated by leaving a safety margin (lease duration > 2× max clock skew) and using bounded-skew clocks. ZooKeeper, etcd, Chubby all use leases.

Health checks: liveness vs readiness

A subtle distinction codified by Kubernetes but predating it: a liveness probe answers “is this process still alive (should I restart it)?”; a readiness probe answers “is this process ready to serve traffic (should I send requests to it)?“. Conflating them causes subtle outages — e.g., a starting-up node fails liveness, gets restarted, never finishes starting. Liveness probes should be deeply conservative (only fail if the process is truly stuck); readiness probes can be more aggressive (drain on slowness, brief blip is fine because traffic just routes elsewhere). Many cascading outages have been traced to over-aggressive liveness probes restarting healthy nodes during transient load.

7. Common architectural patterns

A small set of patterns recur across distributed systems.

Quorum systems

A read-write quorum system on N replicas chooses W (write quorum size) and R (read quorum size) such that R + W > N (every read intersects every write) and W > N/2 (writes are linearizable across themselves). For N=3, the canonical config is W=R=2 — survive one failure with consistency. For N=5, W=R=3 gives two-failure tolerance.

Quorums don’t by themselves give linearizability — Dynamo-style quorum reads can still return stale values during concurrent writes. True linearizability requires either single-leader serialization or full consensus per operation (ABD register, Attiya-Bar-Noy-Dolev 1995).

Vector clocks for causality

Dynamo (and Riak) attach a vector clock to each value. Concurrent writes (incomparable vector clocks) produce siblings — multiple values stored under one key, returned together to the client for merge. Application is responsible for merging. Works well for commutative data (shopping carts, counters), painful for arbitrary blobs.

Read repair + anti-entropy + Merkle trees

Mechanisms to bring divergent replicas back into sync.

  • Read repair. When a read returns mismatched values from R replicas, the coordinator writes the latest back to stale replicas synchronously (foreground — blocks the read until repaired) or asynchronously (background — read returns, repair happens later). Foreground repair improves consistency for the next read; background lowers tail latency.
  • Hinted handoff. When a replica is unreachable, the coordinator stores a “hint” — a pending write to deliver when the replica returns. Bounded duration (typically 3 hours in Cassandra) to prevent indefinite memory growth. After expiry, anti-entropy is the recovery path.
  • Anti-entropy via Merkle trees. Replicas periodically build Merkle trees (hash trees, Merkle 1979) over their data and compare roots with peers; if roots match, replicas are consistent; if not, recurse into mismatched subtrees to find the divergent keys. Originally Dynamo / Cassandra; reused in DynamoDB, Riak, ZFS, Git, blockchain. Bandwidth scales with the number of divergences, not total dataset size — efficient when replicas are mostly consistent.

Sagas (Garcia-Molina + Salem 1987)

A long-running transaction modeled as a sequence of local transactions T1, T2, …, Tn, each with a compensating action C1, …, Cn. If Tk fails, compensate in reverse: Ck-1, Ck-2, …, C1. Sagas trade atomicity for liveness — they don’t hold locks across long durations but are not strictly atomic (an observer between steps sees a partial state). Used widely in microservices for cross-service business transactions (book flight → reserve hotel → charge card; if charge fails, cancel hotel, cancel flight).

Two coordination styles: orchestration (a central saga coordinator drives the steps) vs choreography (each service emits events that trigger the next). Orchestration is easier to reason about and debug; choreography decouples services but the system-wide behavior becomes implicit in the event topology. Production saga frameworks: Temporal (workflow engine, descended from Cadence at Uber), AWS Step Functions, Camunda BPMN engines, Netflix Conductor, Apache Airflow (workflow-ish, more batch-oriented).

Sagas have a subtle correctness requirement: compensating actions must be semantically reversible, not just literally reversible. Refunding a charge is a valid compensation for charging; deleting an email after it was sent is not. When pure reversal is impossible, sagas either avoid the operation in question (delay sending until the saga has confirmed all steps will succeed) or escalate to manual intervention.

Event sourcing + CQRS

Event sourcing: persist a log of immutable events as the source of truth; derive current state by replaying events. Pioneered in finance and trading (every trade is an event; positions are derived). Benefits: audit log built-in, time travel debugging, multiple read models from one write model. Costs: schema evolution of events is hard, debugging requires replaying.

CQRS (Command Query Responsibility Segregation; Greg Young, circa 2010): separate the write model (commands → events) from the read model (queries against projected views). Often paired with event sourcing but independent. Allows the read model to be denormalized, optimized for queries, and eventually consistent w.r.t. the write model.

Idempotency keys + exactly-once semantics

True exactly-once delivery is impossible (corollary of Two Generals’). What systems actually provide is effectively exactly-once = at-least-once delivery + idempotent processing. The standard mechanism: every request carries an idempotency key (UUID); the server records the (key → result) mapping in a deduplication store; on retry with the same key, the server returns the cached result without re-executing. Stripe’s idempotency-keys API (2015) popularized this pattern; Kafka exactly-once semantics (KIP-98, 2017) implements it at the broker level with producer IDs and sequence numbers.

The deduplication store’s retention is a tunable. Too short and retries past the window cause duplicate execution; too long and the store grows unboundedly. Typical window: 24 hours for user-facing APIs, hours-to-days for backend systems.

8. Common failures + edge cases

The failure modes that surprise engineers (and cause real outages).

Network partitions

The worst category because they are silent and ambiguous. From inside a partitioned node, a partition is indistinguishable from “the other side is slow” or “the other side is crashed.” Famous incidents: GitHub Oct 2018 (43-second partition → 24-hour data divergence repair), AWS US-East 2017 networking event (4 hours of partial connectivity), Cloudflare June 2022 (BGP misconfig partitioning their own network from itself for ~30 min).

Worse, partitions can be asymmetric: A can send to B but B cannot send to A. Or transitive: A↔B and B↔C work but A↔C does not. Most consensus protocols handle symmetric partitions cleanly but degrade or pathologically fail on asymmetric ones — Jepsen tests (Aphyr, Kyle Kingsbury) have exposed many such bugs in production systems.

A network link with 10x normal latency is operationally worse than a failed link. The failure detector hasn’t fired, so the system still tries to use the link, and request latencies spike. Mitigations: hedging (Dean + Barroso, “The Tail at Scale,” CACM 2013) — send the same request to two replicas; use the first response, cancel the slower. Adds load but bounds tail latency.

Byzantine failures

Nodes that lie, corrupt, or behave maliciously. The standard distributed-systems model assumes crash-only (or crash-recovery) failures — nodes either work correctly or stop. Byzantine fault tolerance (BFT) extends to nodes that arbitrarily misbehave; requires 3f+1 replicas to tolerate f failures (Lamport, Shostak, Pease 1982 “The Byzantine Generals Problem”). Production Byzantine systems are rare outside of blockchains, aerospace, and high-assurance contexts (NASA’s SAFEbus, Honeywell’s TTP). Covered separately in [[Compute/consensus-protocols]].

Concurrent writes / write skew

Two transactions read overlapping data, then write disjoint data based on what they read. Under snapshot isolation, both commit — but the combined effect violates a constraint. Classic example: on-call scheduling, where two doctors both check “are at least 2 doctors on call?” (yes), then both go off-call. Serializable isolation prevents write skew; snapshot isolation does not. Cockroach, Spanner, FoundationDB provide serializable by default.

Clock skew + leap seconds

Wall-clock comparisons across nodes are unsafe. Worse, NTP can step clocks backward, breaking monotonicity assumptions. Leap seconds — inserted irregularly by IERS to keep UTC aligned with Earth’s rotation — have caused real outages: 2012 leap second crashed Reddit, LinkedIn, Mozilla, Qantas; 2015 leap second hit Cloudflare. Modern systems (Google, AWS, Facebook since 2020) “smear” leap seconds — distribute the extra second over many hours — to avoid the discontinuity.

GC pauses

Stop-the-world garbage collection can pause a JVM process for seconds. Cassandra historically (CMS GC, pre-2018) saw 5–10 s pauses on heaps over 32 GB. During a pause, the node misses heartbeats, gets marked dead, and on wake-up tries to act as if nothing happened — including holding stale locks or leadership. Mitigations: smaller heaps, low-pause GCs (G1, Shenandoah, ZGC), generous failure-detection timeouts, fencing tokens to invalidate stale actions.

”Stop-the-world” infrastructure outages

Cloud-provider-scale events that affect entire regions or services. Notable:

  • AWS S3 outage Feb 2017. Typo during a routine command removed too many servers from S3 index subsystem in us-east-1; cascading impact for ~4 hours; took down a substantial fraction of the public web.
  • Cloudflare June 2022. BGP route change in 19 datacenters effectively disconnected them from the network; ~30 min outage.
  • Fastly June 2021. Configuration push triggered a latent bug; 49-min global edge outage.
  • Google Cloud Networking June 2019. Capacity-management bug took down GCP networking for ~4 hours.
  • Facebook BGP October 2021. A backbone-config push withdrew BGP routes to FB’s DNS servers; FB was effectively unreachable for 6 hours, including for the engineers trying to fix it (locked out of internal tools).

The lesson recurring across these post-mortems: blast radius matters more than per-component reliability. Multi-region failover, gradual rollouts, and out-of-band access paths exist precisely because no single subsystem is reliable enough.

Metastable failures (Bronson et al. 2021)

Recent work (“Metastable Failures in Distributed Systems,” HotOS 2021) identifies a class of incidents where a system enters a degraded state that persists even after the triggering condition resolves — a positive feedback loop in load. Example: a brief spike causes timeouts; clients retry; retries increase load; load causes more timeouts. Even after the original spike passes, the retry storm sustains the degradation. Mitigations: load shedding, retry budgets, exponential backoff with jitter, circuit breakers (Nygard 2007 Release It!), and admission control that explicitly drops requests rather than serving them slowly. Many of the famous outages cited above had metastable phases — the trigger was brief but recovery took hours because the system couldn’t escape the bad equilibrium without operator intervention.

Gray failures (Huang et al. 2017)

“Gray Failure: The Achilles’ Heel of Cloud-Scale Systems” (HotOS 2017) describes a failure mode where the system is partially working — some clients succeed, others fail; some operations work, others hang — without crashing outright. Gray failures evade simple alive/dead failure detectors. Examples: a degraded NIC dropping 10% of packets; a disk performing reads but failing writes intermittently; a TLS certificate expired on one of N service instances. Detection requires application-level health checks (synthetic transactions, error-rate metrics) rather than just network heartbeats.

Tail latency amplification

Dean + Barroso 2013 (“The Tail at Scale”) show that even when individual servers have rare slow responses, a request that fans out to many servers will almost always hit at least one slow one. If each of 100 backends has 1% chance of >1 s latency, the probability that a request touching all 100 sees at least one slow backend is 63%. P99.9 latency thus becomes the bottleneck for fan-out architectures. Standard mitigations: hedged requests (issue duplicate after p95 timeout), tied requests (cancel the loser), prioritization of straggler operations, and shrinking fan-out width via more local computation.

The math: if each request has independent success probability p, the probability that all N requests are fast is p^N. For p=0.99 and N=100, p^N = 0.366 — so 63.4% of fan-out requests touch at least one slow backend. The tail-amplification problem is why hyperscale systems invest heavily in p99 stability rather than mean-latency optimization. Reducing variance is more valuable than reducing average.

Heisenbugs and once-only repro

Distributed bugs often depend on a specific ordering of messages, processes, and timer fires. They reproduce maybe once per million runs in production but never in a controlled environment. Tools that help: deterministic-replay frameworks (FoundationDB’s simulation, RR debugger), chaos engineering (Netflix Chaos Monkey, Gremlin) to force fault scenarios, formal methods (TLA+, P, Coq) to verify properties at the spec level rather than testing the implementation. The lesson: testing alone cannot find these bugs; the system must be designed to be debugged.

Pat Helland’s observation deserves repeating: in distributed systems, “the past is incomplete and the future is uncertain.” You never have full information about what has happened (network in flight, asynchronous replicas, stale caches) and never have certainty about what will happen (which nodes will fail, which messages will be lost). Designing systems with explicit handling for this incompleteness — rather than pretending it doesn’t exist — is the difference between systems that operate at scale and systems that have surprise outages every few months.

9. Production systems map

A snapshot of how widely-deployed systems sit in this design space.

  • Dynamo (Amazon 2007). DeCandia et al. SOSP 2007. Original leaderless eventual-consistency Key-Value store. Consistent hashing, vector clocks, sloppy quorum, hinted handoff, Merkle anti-entropy. Internal to Amazon; not directly available.
  • Riak (Basho 2009–2017). Open-source Dynamo descendant. Vector-clock-based; CRDT support; multi-datacenter. Company defunct since 2017 but code lives at riak/riak.
  • Cassandra (Facebook → Apache 2008). Lakshman + Malik. Leaderless, tunable consistency (ANY / ONE / QUORUM / ALL / LOCAL_QUORUM / EACH_QUORUM), wide-column data model, gossip + phi-accrual failure detection. Strong at high write throughput; weak at heavy contended updates. Hot at trillion-row scale across hyperscale users (Apple, Netflix, Discord historically).
  • DynamoDB (AWS 2012). Managed evolution of Dynamo: multi-tenant, leaderless under the hood, but exposes strong (single-region linearizable per item) and eventual reads at the API. Per-item transactions added 2018. Global tables (multi-region active-active) since 2017.
  • Spanner (Google 2012). Corbett et al. OSDI 2012. Globally distributed, externally consistent (strictly serializable) database. TrueTime + Paxos groups per tablet + 2PC across tablets. The first system to provide global linearizability at scale.
  • CockroachDB (Cockroach Labs 2014). Spanner-inspired but uses HLC instead of TrueTime (no GPS/atomic hardware). Range-partitioned, Raft per range, 2PC for cross-range. Open-source core + commercial.
  • TiDB / TiKV (PingCAP 2016). Same lineage. Raft groups over range-partitioned data; HLC; Percolator-style transactions.
  • etcd (CoreOS → CNCF 2013). Strongly-consistent KV store; Raft; gRPC API; foundational to Kubernetes (cluster state lives in etcd). Small-data, high-consistency.
  • Consul (HashiCorp 2014). Service mesh + KV; Raft for KV consensus; gossip for service-discovery membership; multi-datacenter via WAN gossip.
  • ZooKeeper (Yahoo → Apache 2008). Hunt et al. USENIX ATC 2010. Zab consensus (Zookeeper Atomic Broadcast — Paxos variant). Hierarchical KV (znodes), watches, ephemeral nodes for liveness. Widely used as coordination service (Kafka pre-KRaft, HBase, Hadoop, Solr).
  • Kafka (LinkedIn → Apache 2011). Distributed log; per-partition leader; replicates via in-sync replica set (ISR); originally relied on ZK for metadata, transitioned to KRaft (self-managed Raft) starting 2.8 (2021), default in 3.3+.
  • MongoDB (10gen → MongoDB Inc 2009). Primary-secondary replica sets; Raft-like leader election since 3.2 (2015) and full Raft-like log replication since 3.6 (2017). Sharding via config-server-managed range partitions.
  • Redis Cluster (Redis 2015). 16384 hash slots distributed across master nodes; per-master async replicas; no strong consistency (acknowledged writes can be lost on master crash). Strong with sentinel-managed single-master for non-clustered Redis.
  • Vitess (YouTube → CNCF 2010). MySQL sharding middleware; range or hash partitioning via VSchema; cross-shard query routing; topology service (etcd/ZK/Consul). YouTube and Slack run substantial fractions of their traffic on Vitess.
  • Citus (Postgres extension; Microsoft 2019). Shards Postgres tables across worker nodes; coordinator routes queries. Distributed transactions via 2PC.
  • YugabyteDB (Yugabyte 2019). Spanner-inspired open-source; HLC; PostgreSQL-compatible SQL layer + Cassandra-compatible NoSQL layer.
  • FoundationDB (Apple 2009, open-sourced 2018). Distributed ordered KV with strict serializability via optimistic concurrency control. A single logical resolver (replicated via Paxos) orders transactions. Storage layer is multiple read-replicas of a transaction log. Used as the metadata layer for Snowflake, iCloud, and others. Notable for an extensive deterministic-simulation test framework that runs the cluster code under a synthetic scheduler against thousands of fault patterns.
  • FaunaDB (2017). Calvin-style deterministic transactions (Thomson + Abadi SIGMOD 2012) — order transactions globally before execution, then execute deterministically on each replica. Eliminates the need for runtime coordination but assumes transaction inputs are known up front.
  • Aurora (AWS 2014). MySQL/Postgres-compatible managed DB; single writer; storage layer is a 6-way replicated log on quorum (4-of-6 write, 3-of-6 read). Compute is stateless; the “the log is the database” design pushes redo-log replication into shared storage. Aurora Multi-Master (2018) added multi-writer but with constrained use cases.
  • BigTable (Google 2006). Chang et al. OSDI 2006. Wide-column store; tablets (range-partitioned); Chubby for coordination; GFS for storage. Predecessor of HBase, Cassandra (data model), Cloud Bigtable.
  • HBase (Apache 2008). Open-source Bigtable clone; HDFS for storage; ZooKeeper for coordination; master-driven region assignment.
  • Apache Pulsar (Yahoo → Apache 2016). Distributed pub-sub; segments stored in BookKeeper (a separate distributed log service with quorum writes). Decoupled compute (brokers) and storage (bookies) — broker failover does not move data.
  • NATS / NATS JetStream (Synadia 2011). Lightweight pub-sub; JetStream layer adds persistence via Raft per stream. Optimized for low-latency edge messaging more than throughput.
  • RabbitMQ (Pivotal 2007). AMQP broker; quorum queues since 3.8 (Raft-based, 2019). Classic mirrored queues are async-replicated and have well-known consistency issues under partition (Jepsen findings, 2014).
  • Tigris (open-source 2021). Built on FoundationDB; offers document-database semantics with strict serializability.
  • ScyllaDB (2015). C++ rewrite of Cassandra by Avi Kivity et al.; same data model and protocol; 10x throughput per node via shared-nothing per-core sharding (Seastar framework).
  • MinIO (2014). S3-compatible object store; erasure-coded across N drives; quorum reads/writes; rolling restart for upgrades.
  • Ceph (2007, RedHat 2014). Unified storage (object, block, file); CRUSH algorithm for deterministic placement without metadata lookups; Paxos for cluster-map consensus.
  • Riak TS (2016, discontinued). Time-series variant of Riak; example of how time-series workloads strain general-purpose distributed stores.

10. Cross-references

  • [[Compute/consensus-protocols]] — Paxos, Multi-Paxos, Raft, Zab, Viewstamped Replication, PBFT, Tendermint, HotStuff. Picks up where this note stops — once you accept FLP and want practical consensus, this is the next stop.
  • [[Compute/database-internals]] — storage engines (LSM, B-tree), MVCC, write-ahead logs, query planners, isolation levels in depth, two-phase commit, distributed query execution.
  • [[Compute/networking-fundamentals]] (TBD) — OSI layers, TCP, UDP, QUIC, BGP, anycast, load balancers, service mesh, congestion control.
  • [[Compute/storage-systems]] (TBD) — block vs file vs object storage, distributed filesystems (HDFS, Ceph, GlusterFS), object stores (S3, GCS), erasure coding.
  • [[Compute/observability]] (TBD) — metrics, traces, logs, distributed tracing (Dapper, OpenTelemetry), SLI/SLO/SLA, error budgets.
  • [[Engineering/Tier3/standards-bodies]] — IEEE (1588 PTP), IETF (RFC 5905 NTP, RFC 7540 HTTP/2, QUIC RFCs).
  • [[Compute/cloud-native-architecture]] (TBD) — Kubernetes, service mesh, microservices, sidecar pattern, GitOps.
  • [[Compute/streaming-systems]] (TBD) — Kafka deep dive, Flink, Spark Streaming, exactly-once semantics, windowing, watermarks.

11. Worked example — quorum math

Walking through a concrete example to make quorum sizing tangible. Suppose N=5 replicas, W=3, R=3. Verify R + W > N (3 + 3 = 6 > 5) and W > N/2 (3 > 2.5) for write-write linearizability.

A write succeeds when 3 of 5 replicas acknowledge — tolerates 2 failed replicas. A read queries 3 of 5; any read overlaps any write at at least 1 replica, guaranteeing the read sees the latest committed write (modulo concurrent in-flight writes).

Trade-offs of the choices:

  • N=3, W=2, R=2. Minimum quorum; tolerates 1 failure. Standard for cost-conscious deployments.
  • N=5, W=3, R=3. Tolerates 2 failures; modest overhead; standard for “production-critical.”
  • N=5, W=4, R=2. Write-heavy guarantee — even if 1 replica is down, 4 must ack. Reads are cheap. Tolerates 1 write failure, 3 read failures.
  • N=5, W=2, R=4. Write-optimized — writes succeed with 2 acks. Reads pay the cost. Tolerates 3 write failures, 1 read failure.
  • N=7, W=4, R=4. Tolerates 3 simultaneous failures; common for “five-nines” availability targets where the operational expectation is f=3 failures during a quorter.

Geographic awareness: in multi-region deployments, replicas in the same datacenter are not independent — they can all fail together (rack power, datacenter network). The effective N is the number of independent failure domains. A common pattern: 3 datacenters × 2 replicas each = 6 physical, but only 3 “fault zones.” Configure quorum at the fault-zone level (Cassandra’s NetworkTopologyStrategy does exactly this with EACH_QUORUM / LOCAL_QUORUM).

12. Worked example — clock-skew bounds

Spanner’s commit-wait illustrates the cost of physical-clock-based ordering. To make a transaction T externally consistent:

  1. T acquires its commit timestamp s from TT.now().latest — the upper bound of TrueTime’s uncertainty interval.
  2. T commits all writes and releases locks only at TT.now().earliest > s.

The wait time = uncertainty interval ε. In Spanner this is ~7 ms typical, peaks at ~10 ms. Every read-write transaction pays this cost. The trade-off: pay a few ms per transaction for global linearizability.

CockroachDB cannot rely on GPS+atomic hardware, so HLC is used instead. The skew bound is configurable (default 500 ms) and CockroachDB will retry transactions that hit “uncertainty intervals” between read and write — effectively trading explicit wait for retry-on-conflict. Both are valid approaches to the same fundamental problem.

13. Anti-patterns to avoid

A short catalog of patterns that look attractive but break in production. Cited because they recur in many architecture interviews and naive design docs.

  • Distributed locks across services. A single lock acquired by service A and released by service B across the network seems convenient but is provably unsafe without fencing. The classic example: process holds a lock, garbage-collects for 30 s, lease expires, lock acquired by another process, first process wakes and “still has the lock” — collision. Martin Kleppmann’s 2016 “How to do distributed locking” essay shows even Redis Redlock is unsafe without fencing tokens. Use leases + fencing tokens, or avoid the design.
  • Distributed transactions across many services. 2PC across 10 microservices is asking for cascading failures. Each service becomes a hard dependency of every other. Prefer sagas, eventual consistency, or restructure ownership so a single service owns the transactional state.
  • “Eventually consistent will be fine” without bounds. Eventually consistent systems need either an SLO on staleness (e.g., 99.9% of reads within 100 ms of latest write) or application logic that tolerates arbitrary staleness. “Eventually” with no bound is operationally untestable.
  • Synchronous calls in the request path that can be async. Every synchronous dependency multiplies your effective failure rate. Push to the background what can be pushed: emails, analytics, audit logs, search-index updates. Use a durable queue.
  • Reading from replicas without considering consistency. Offloading reads to followers is fine if reads tolerate staleness. If they don’t, you’re introducing user-visible bugs (the classic: user updates profile, refreshes page, sees old data because read hit a lagging replica).
  • Custom consensus protocols. Almost always wrong. Use Raft, Paxos, or a battle-tested implementation (etcd, ZooKeeper, Consul). Roll-your-own consensus is a category of bug that Jepsen has caught repeatedly in real systems.
  • Treating timeouts as success. A request that times out has indeterminate outcome — it may have completed or not. Treating timeout as “definitely failed” leads to retry-induced duplicate effects; treating it as “definitely succeeded” leads to lost work. The only safe handling is idempotent operations + retry with idempotency key.
  • Unbounded fan-out. A request that touches 1000 backends has multiplicatively worse availability than each backend’s individual availability. P99 of each becomes near-certain on the aggregate. Cap fan-out width; aggregate at multiple tiers.
  • Sharing a database between services. Two services sharing a single database table become tightly coupled at the schema level — a “shared-database microservices” architecture. Schema changes require coordinated deploys; the database becomes the integration bus. Prefer service-owned databases with explicit APIs.

14. Operational rules of thumb

A non-exhaustive list of empirical guidance for operating distributed systems:

  • The 9s lie. A claimed “99.99% availability” is meaningful only in the steady state and only for the components it’s measured over. Add a cross-region dependency, a DNS resolver, a TLS cert, an ops process — each subtracts 9s. Real end-to-end availability is the product of all dependency availabilities; designing for “five nines” without controlling for tail dependencies is wishful thinking.
  • Failure is not an event but a rate. Don’t design for zero failures; design for a steady-state failure rate and ensure it’s tolerable. A 10000-node cluster will see node failures every day; the system must absorb that without operator intervention.
  • Operator error dominates. Post-mortems consistently show that ~70% of major outages are triggered by humans — bad pushes, mistyped commands, capacity miscalculation. Invest in tooling that makes the right thing easy and the wrong thing hard. Two-person commits, staged rollouts, tested rollback paths.
  • Backpressure or burn. Every component must have a mechanism to refuse work it cannot handle. A queue with no upper bound is a memory leak. A connection pool with no admission control is a death spiral. Cap, shed, return errors fast.
  • Retries are dangerous. Exponential backoff is the floor, not the ceiling. Add jitter, add retry budgets (e.g., max 10% retry rate across a tier), add circuit breakers. The system that retries forever turns transient faults into outages.
  • Timeouts everywhere. Every network call must have a timeout. The default ought to be tight (seconds, not minutes). Hung operations are worse than failed operations because they hold resources.
  • Observability before optimization. You cannot fix what you cannot see. Invest in distributed tracing (OpenTelemetry, Jaeger, Honeycomb), structured logs, SLI metrics with per-percentile breakdowns. Before paging on CPU or memory, page on user-facing SLO violations.
  • The simplest topology that works. Multi-region active-active is operationally expensive; primary-secondary with regional failover is cheaper and sufficient for most. Pay the complexity tax only when the requirements force it.
  • Document the failure model. Every distributed system has a model — set of assumed failure types, network model, clock assumptions. Write it down. The system is only safe within that model; outside it (network is reliable, clocks are perfectly synced, all nodes are honest) all bets are off. Many catastrophic outages happen when reality steps outside the model and nobody noticed.
  • Test in production-like conditions. Tests that pass in low-load conditions do not validate behavior at scale. Use shadow traffic, dark launches, gradual rollouts, and game-day fault injection. Chaos engineering at Netflix (Chaos Monkey 2010, Chaos Kong 2016) is a template for organizational practice.
  • Capacity planning matters. Most distributed systems degrade ungracefully when overloaded. Run regular load tests; know the breaking point of every tier; have headroom (typically 2x peak) on every critical resource. Cloud autoscaling helps but is not a substitute for understanding the math.

15. Common interview / design questions

A catch-list of canonical problems that exercise this material. Each has multiple defensible solutions; the discussion is the point.

  • Design a global key-value store. Probes: consistency model? Replication strategy? Partition scheme? Failover? Multi-region?
  • Design a distributed counter. Probes: contention handling, conflict resolution (PN-counter CRDT vs centralized leader), tolerance for inaccuracy.
  • Design a URL shortener at scale. Probes: ID generation (snowflake vs uuid vs hash), partitioning, caching, analytics fan-out.
  • Design a rate limiter. Probes: per-user vs global, sliding window vs token bucket, distributed counter accuracy, eventual consistency tolerance.
  • Design a job scheduler / cron-at-scale. Probes: leader election, fencing, deduplication if scheduler restarts, partial execution.
  • Design a real-time leaderboard. Probes: write-heavy aggregation, sorted-set partitioning, eventual vs strong consistency for ranks.
  • Design distributed tracing. Probes: trace ID propagation, sampling, span aggregation, cardinality control.
  • Design a queue / log system. Probes: durability, ordering guarantees per partition, consumer groups, exactly-once vs at-least-once semantics.

16. Reading order for newcomers

A suggested progression through the literature, in roughly the order it makes sense to read:

  1. Kleppmann, Designing Data-Intensive Applications (2017). The practitioner introduction. Chapters 5–9 cover the same territory as this note but at book length with worked examples and current production references. Start here even if you only finish the first half.
  2. Lamport “Time, Clocks” (1978). The foundational paper on logical time. Short, readable, transformative.
  3. Fischer, Lynch, Paterson “Impossibility” (1985). The proof; not long; the right mental model for asynchronous systems.
  4. Gilbert + Lynch “Brewer’s Conjecture” (2002). The CAP proof; pairs with the original 2000 Brewer talk and the 2012 retrospective.
  5. Dean + Barroso “The Tail at Scale” (2013). The operational reality of fan-out architectures.
  6. DeCandia et al. “Dynamo” (2007). The seminal leaderless-system paper; introduces or popularizes consistent hashing, vector clocks, sloppy quorum, hinted handoff, read repair, Merkle anti-entropy.
  7. Corbett et al. “Spanner” (2012). The seminal globally-consistent paper; introduces TrueTime.
  8. Ongaro + Ousterhout “Raft” (2014). The pragmatic consensus protocol; understandable in one sitting.
  9. Helland “Life beyond Distributed Transactions” (2007). The right mental shift away from “just use a transaction” for cross-service work.
  10. Jepsen reports (ongoing). Concrete examples of what goes wrong when theory meets implementation; pick any few systems you use and read their reports.

After this corpus, the remaining literature is largely depth on specific topics (consensus variants, CRDT design, transaction protocols, streaming semantics) which is appropriate to engage as the work demands.

17. Citations

  • Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly, 2017. The single best practitioner-oriented reference for the material in this note. Especially chapters 5 (replication), 6 (partitioning), 7 (transactions), 8 (trouble with distributed systems), 9 (consistency and consensus).
  • Tanenbaum, Andrew S., and Maarten van Steen. Distributed Systems. 3rd ed., 2017. Self-published / Pearson. The classic textbook; more theoretical than Kleppmann.
  • Brewer, Eric. “Towards Robust Distributed Systems.” PODC 2000 keynote. The CAP conjecture as originally posed.
  • Brewer, Eric. “CAP Twelve Years Later: How the ‘Rules’ Have Changed.” IEEE Computer 45(2), 2012. Retrospective and refinement.
  • Gilbert, Seth, and Nancy Lynch. “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.” ACM SIGACT News 33(2), 2002. The formal CAP proof.
  • Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. “Impossibility of Distributed Consensus with One Faulty Process.” Journal of the ACM 32(2), 1985.
  • Lamport, Leslie. “Time, Clocks, and the Ordering of Events in a Distributed System.” Communications of the ACM 21(7), 1978.
  • Lamport, Leslie. “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.” IEEE Transactions on Computers C-28(9), 1979. Sequential consistency.
  • Lamport, Leslie, Robert Shostak, and Marshall Pease. “The Byzantine Generals Problem.” ACM TOPLAS 4(3), 1982.
  • Herlihy, Maurice P., and Jeannette M. Wing. “Linearizability: A Correctness Condition for Concurrent Objects.” ACM TOPLAS 12(3), 1990.
  • Saltzer, Jerome H., David P. Reed, and David D. Clark. “End-to-End Arguments in System Design.” ACM Transactions on Computer Systems 2(4), 1984.
  • Abadi, Daniel. “Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story.” IEEE Computer 45(2), 2012. PACELC paper.
  • Dwork, Cynthia, Nancy Lynch, and Larry Stockmeyer. “Consensus in the Presence of Partial Synchrony.” Journal of the ACM 35(2), 1988.
  • Schneider, Fred B. “Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial.” ACM Computing Surveys 22(4), 1990.
  • DeCandia, Giuseppe, et al. “Dynamo: Amazon’s Highly Available Key-Value Store.” SOSP 2007.
  • Corbett, James C., et al. “Spanner: Google’s Globally-Distributed Database.” OSDI 2012.
  • Hayashibara, Naohiro, Xavier Defago, Rami Yared, and Takuya Katayama. “The φ Accrual Failure Detector.” SRDS 2004.
  • Karger, David, et al. “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web.” STOC 1997.
  • Kulkarni, Sandeep, Murat Demirbas, Deepak Madappa, Bharadwaj Avva, and Marcelo Leone. “Logical Physical Clocks.” OPODIS 2014. HLC paper.
  • Lloyd, Wyatt, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. “Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS.” SOSP 2011.
  • Mahajan, Prince, Lorenzo Alvisi, and Mike Dahlin. “Consistency, Availability, and Convergence.” UT Austin TR 2011. Proves causal+ is the strongest model achievable with availability under partition.
  • Terry, Douglas B., et al. “Session Guarantees for Weakly Consistent Replicated Data.” PDIS 1994. Bayou.
  • Shapiro, Marc, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. “A Comprehensive Study of Convergent and Commutative Replicated Data Types.” INRIA RR-7506, 2011. CRDT foundations.
  • Garcia-Molina, Hector, and Kenneth Salem. “Sagas.” SIGMOD 1987.
  • Dean, Jeffrey, and Luiz André Barroso. “The Tail at Scale.” Communications of the ACM 56(2), 2013.
  • Hunt, Patrick, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. “ZooKeeper: Wait-free Coordination for Internet-scale Systems.” USENIX ATC 2010.
  • Burrows, Mike. “The Chubby Lock Service for Loosely-Coupled Distributed Systems.” OSDI 2006.
  • Akkoyunlu, E. A., K. Ekanadham, and R. V. Huber. “Some Constraints and Tradeoffs in the Design of Network Communications.” SOSP 1975. Earliest Two Generals’ formulation.
  • Gray, Jim. “Notes on Data Base Operating Systems.” Operating Systems: An Advanced Course, Springer LNCS 60, 1978. Two Generals’ Problem named; two-phase commit.
  • Chandra, Tushar Deepak, and Sam Toueg. “Unreliable Failure Detectors for Reliable Distributed Systems.” Journal of the ACM 43(2), 1996.
  • Attiya, Hagit, Amotz Bar-Noy, and Danny Dolev. “Sharing Memory Robustly in Message-Passing Systems.” Journal of the ACM 42(1), 1995. ABD algorithm for atomic register.
  • Attiya, Hagit, and Jennifer Welch. “Sequential Consistency versus Linearizability.” ACM TOCS 12(2), 1994. Lower-bound proof on linearizable read latency.
  • Bronson, Nathan, et al. “Metastable Failures in Distributed Systems.” HotOS 2021.
  • Huang, Peng, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, Randolph Yao. “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems.” HotOS 2017.
  • Fidge, Colin J. “Timestamps in Message-Passing Systems That Preserve the Partial Ordering.” Australian Computer Science Conference, 1988. Vector clocks.
  • Mattern, Friedemann. “Virtual Time and Global States of Distributed Systems.” Workshop on Parallel and Distributed Algorithms, 1989. Vector clocks (independent).
  • Merkle, Ralph C. “A Certified Digital Signature.” Stanford PhD thesis, 1979 (Merkle trees first publication; later applied to anti-entropy).
  • Berenson, Hal, et al. “A Critique of ANSI SQL Isolation Levels.” SIGMOD 1995. Snapshot isolation defined.
  • Eswaran, Kapali P., Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger. “The Notions of Consistency and Predicate Locks in a Database System.” Communications of the ACM 19(11), 1976.
  • Cahill, Michael J., Uwe Röhm, and Alan D. Fekete. “Serializable Isolation for Snapshot Databases.” SIGMOD 2008. SSI; foundation of PostgreSQL’s serializable mode.
  • Thomson, Alexander, and Daniel J. Abadi. “The Case for Determinism in Database Systems.” SIGMOD 2012. Calvin deterministic transactions.
  • Ongaro, Diego. “Consensus: Bridging Theory and Practice.” Stanford PhD thesis, 2014. Raft and SMR engineering at length.
  • Chang, Fay, et al. “Bigtable: A Distributed Storage System for Structured Data.” OSDI 2006.
  • Peng, Daniel, and Frank Dabek. “Large-scale Incremental Processing Using Distributed Transactions and Notifications.” OSDI 2010. Percolator.
  • Nygard, Michael T. Release It! Design and Deploy Production-Ready Software. Pragmatic Bookshelf, 2007 (2nd ed. 2018). Circuit breakers, bulkheads, operational patterns.
  • Aphyr (Kingsbury, Kyle). Jepsen reports, 2013–present (https://jepsen.io). Empirical analysis of distributed systems under partition and skew; has found correctness bugs in essentially every system tested.
  • Ben-Or, Michael. “Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols.” PODC 1983.
  • Schwarzkopf, Malte. “The remarkable utility of dataflow computing.” PhD thesis, 2015 (and Pat Helland’s “Heisenberg Was on the Write Track” CIDR 2015). On the broader dataflow / event-driven view of distributed state.
  • Helland, Pat. “Life beyond Distributed Transactions: An Apostate’s Opinion.” CIDR 2007. Foundational essay on why distributed transactions cannot be the unit of all coordination at scale.
  • Helland, Pat. “Immutability Changes Everything.” CIDR 2015 / CACM 2016. On append-only data architectures.
  • Bailis, Peter, and Ali Ghodsi. “Eventual Consistency Today: Limitations, Extensions, and Beyond.” ACM Queue 11(3), 2013. A pragmatic survey of eventual consistency in practice.
  • Bailis, Peter, et al. “Highly Available Transactions: Virtues and Limitations.” VLDB 2014. Characterizes which transactional guarantees are achievable with availability.
  • Brooker, Marc. AWS blog posts on Aurora, DynamoDB, and EBS architecture. A useful complement to academic papers for production-system reasoning.
  • Verbitski, Alexandre, et al. “Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases.” SIGMOD 2017.
  • DeCandia + Hastorun et al. (Dynamo paper, already cited above) — re-read in conjunction with Brooker’s DynamoDB writeups for the evolution from the 2007 design to 2024 multi-tenant managed.
  • Cooper, Brian F., et al. “PNUTS: Yahoo!‘s Hosted Data Serving Platform.” VLDB 2008. Geographically replicated record store with timeline consistency.
  • Lloyd, Wyatt, et al. “Stronger Semantics for Low-Latency Geo-Replicated Storage.” NSDI 2013. Eiger; causal+ at scale.
  • Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. “Zab: High-performance Broadcast for Primary-Backup Systems.” DSN 2011.
  • Ongaro, Diego, and John Ousterhout. “In Search of an Understandable Consensus Algorithm.” USENIX ATC 2014. The Raft paper.
  • Bishop, Steven, et al. “Engineering with Logic: HOL Specification and Symbolic-Evaluation Testing for TCP Implementations.” POPL 2006. On formal verification at the network protocol layer.
  • Newcombe, Chris, et al. “How Amazon Web Services Uses Formal Methods.” CACM 58(4), 2015. TLA+ in production engineering.
  • Lampson, Butler W. “Hints for Computer System Design.” SOSP 1983. General-purpose design heuristics that age well; relevant to distributed systems by extension.
  • Helland, Pat, and Dave Campbell. “Building on Quicksand.” CIDR 2009. On the inherent uncertainty of distributed state.
  • Kleppmann, Martin. “How to do distributed locking” (2016 blog, jepsen-adjacent). Critique of Redis Redlock; canonical example of why fencing tokens are non-negotiable.
  • Bailis, Peter, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. “Highly Available Transactions: Virtues and Limitations” (already cited) — survey of which weak transactional semantics are availability-friendly.
  • Birman, Kenneth P. “Reliable Distributed Systems: Technologies, Web Services, and Applications.” Springer, 2005. A second textbook with more focus on group communication and gossip systems.
  • Coulouris, George, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems: Concepts and Design. 5th ed., 2011. Long-running textbook; broader survey than Tanenbaum + van Steen.
  • Burns, Brendan. Designing Distributed Systems. O’Reilly, 2018. Pattern catalog: sidecar, ambassador, adapter, leader election, work queue, scatter-gather. Light but useful for taxonomy.
  • The Morning Paper (Adrian Colyer, 2014–2021). Daily paper summaries; pre-eminent venue for “what’s the SIGMOD/SOSP/OSDI takeaway.” Indexed at https://blog.acolyer.org.