Consensus Protocols — Compute Reference

1. At a glance

Distributed consensus is the problem of getting a set of replicas — each with its own clock, its own memory, and its own ability to fail — to agree on a single value, or on the order of a sequence of values, despite arbitrary failures of some subset of the participants and the message-passing network between them. It is the foundational primitive that turns a collection of unreliable machines into a single coherent system. Almost every interesting distributed property in production today — a Kubernetes cluster electing a single API-server leader, etcd persisting a config change durably, Kafka ordering produces into a partition log, CockroachDB linearizing transactions across continents, Bitcoin agreeing on the next block — reduces to a consensus protocol underneath.

The shape of the problem looks deceptively simple: nodes propose values, exchange messages, and eventually decide on one. The difficulty is hidden in the fault model. In an asynchronous network with even a single crashing node, no deterministic consensus algorithm can guarantee both safety and liveness — this is the FLP impossibility result (Fischer, Lynch, Paterson 1985, JACM). Every real protocol therefore makes a trade: assume partial synchrony (Raft, Paxos, Zab, PBFT), assume bounded message delay during steady state (most leader-based protocols), or trade strict safety for probabilistic safety with economic incentives (Bitcoin, Ethereum).

Why it matters: consensus is the backbone of the replicated state machine (RSM) abstraction (Schneider 1990, ACM Computing Surveys). If every replica receives the same log of commands in the same order and applies them deterministically, every replica ends in the same state. RSM is how etcd, ZooKeeper, Spanner, CockroachDB, and TiKV give you a database that survives node loss without lost writes. Without consensus you have eventual consistency at best; with it you have linearizable reads and writes, leader-election semantics that hold under partition, and durable cluster metadata that one bad disk cannot destroy.

2. Problem statement

Formally, a consensus protocol must satisfy three properties (Lamport 1978; Dwork, Lynch, Stockmeyer 1988):

  • Agreement — no two non-faulty nodes decide on different values.
  • Validity — the decided value must be a value that some node proposed (no inventing values out of thin air).
  • Termination — every non-faulty node eventually decides.

Agreement is a safety property: “nothing bad ever happens” — even under arbitrary delays, network partitions, or adversarial scheduling, two replicas will never disagree. Termination is a liveness property: “something good eventually happens” — given enough time and a network that recovers, the system makes progress.

The FLP impossibility theorem (1985) proves you cannot have both in a fully asynchronous model with even one crash failure. Real protocols handle this by:

  1. Assuming partial synchrony — message delays are bounded eventually, even if not always. This is the GST (Global Stabilization Time) model (Dwork, Lynch, Stockmeyer 1988). Paxos, Raft, Zab all live here.
  2. Using randomization (Ben-Or 1983; Rabin 1983) — flipping coins to break symmetry in the worst case. Used in Bitcoin’s nonce-mining as a side-effect.
  3. Tolerating periods of unavailability — under a partition, the minority side stops accepting writes (sacrificing availability per CAP) so safety is preserved.

Fault models

The protocol’s fault model dictates how much it costs to run:

  • Crash-stop (fail-stop) — nodes fail by halting; they never send wrong messages. Paxos, Raft, Zab, Viewstamped Replication all assume this. Tolerates f failures with 2f+1 nodes (a 5-node cluster tolerates 2 failures).
  • Crash-recovery — nodes can crash and later recover with their durable state intact. All modern protocols handle this; it forces the protocol to fsync state before responding.
  • Byzantine — failed nodes can do anything: send conflicting messages, lie, collude. Requires 3f+1 nodes to tolerate f failures. PBFT, HotStuff, Tendermint live here. Blockchain protocols are Byzantine by definition (anyone can join, no trust).
  • Omission / network failures — messages lost or duplicated. Handled by retries and idempotent message IDs in all protocols.

Practical sizing: production etcd, ZooKeeper, Consul clusters are almost always 3 or 5 nodes (tolerates 1 or 2 failures). PBFT-class systems start at 4 nodes (tolerates 1 Byzantine) and rarely exceed 100 because of quadratic message overhead.

3. Paxos

Leslie Lamport introduced Paxos in a 1989 technical report (DEC SRC), formally published as “The Part-Time Parliament” in ACM Transactions on Computer Systems (1998). The paper described consensus as the workings of a fictional Greek-island parliament where legislators wandered in and out of the chamber. It was so opaque that it took nine years from submission to publication, and Lamport later wrote “Paxos Made Simple” (2001, SIGACT News) as a remedial explanation. He still maintained, with some justification, that the protocol itself was simple — the metaphor was the problem.

Roles

  • Proposer — proposes a value. May be any node; in practice one elected leader.
  • Acceptor — votes on proposals. A majority of acceptors must accept a proposal for it to be chosen. Acceptors persist their state.
  • Learner — learns the chosen value (often the same nodes as proposers/acceptors in practice).

Single-decree Paxos (Synod algorithm)

The protocol uses ballot numbers (round numbers) totally ordered across all proposers (typically a tuple (round, node-id)). Each proposer attaches a unique, monotonically increasing ballot to its proposal.

Phase 1 (Prepare / Promise):

  1. Proposer chooses ballot n and sends prepare(n) to a majority of acceptors.
  2. Each acceptor that has not promised a higher ballot responds with promise(n), including the highest-numbered proposal it has already accepted (if any).
  3. If the proposer sees promises from a majority, it proceeds to Phase 2; otherwise it bumps the ballot and retries.

Phase 2 (Accept / Accepted):

  1. The proposer picks the value: if any acceptor returned a previously accepted value, the proposer must use the one with the highest ballot. Otherwise it uses its own value.
  2. Proposer sends accept(n, value) to acceptors.
  3. Each acceptor accepts iff it has not promised a higher ballot. It responds with accepted(n, value).
  4. Once a majority of acceptors have accepted, the value is chosen. Learners are notified.

The key safety invariant: once a value is chosen at ballot n, every higher ballot will see and re-propose the same value, because any majority intersects any other majority in at least one acceptor that remembers the accepted value.

Multi-Paxos

Single-decree Paxos chooses one value. For replicated logs (the actual workload), you run a Paxos instance for each log slot. Multi-Paxos observes that if you keep the same leader, Phase 1 (prepare) can be done once for an entire range of slots — only Phase 2 (accept) needs to happen per command. This collapses commit latency to one round-trip from leader to a majority.

The leader is itself elected by a degenerate Paxos round. Failure of the leader triggers a new election.

Why engineers hate it

Three reasons it gets a bad rap:

  1. Implicit leader. Lamport’s papers treat the leader as an optimization, not a protocol element, so implementers are left to figure out election themselves.
  2. Log compaction, membership changes, and reconfiguration are not in the original paper. Every production system reinvents them differently.
  3. The metaphor. Reading “Paxos Made Simple” cold, most engineers come away unable to implement it.

Google built Chubby (Burrows 2006, OSDI) on Multi-Paxos for distributed locking; the lessons-learned paper “Paxos Made Live” (Chandra, Griesemer, Redstone 2007, PODC) is more useful than the originals for understanding what a real implementation requires. Spanner uses Multi-Paxos per tablet.

4. Raft

Diego Ongaro and John Ousterhout introduced Raft in 2014 (USENIX Annual Technical Conference, best-paper award) with the explicit design goal of understandability. The original paper is titled “In Search of an Understandable Consensus Algorithm” and the abstract acknowledges that Paxos is correct and efficient but pedagogically a disaster. Raft is provably equivalent in safety and liveness, but structured around three orthogonal subproblems instead of one monolithic algorithm: leader election, log replication, and safety. The algorithm fits on one page; the paper fits on 16; a competent engineer can implement a working version in a week.

Cluster state

Every node is in one of three states: follower, candidate, or leader. Time is divided into terms — monotonically increasing integers — and each term has at most one leader. A term begins with an election. If no leader is elected (split vote), the term ends and a new election starts in the next term.

Each node persists three things to disk: its currentTerm, the votedFor candidate in that term, and its log entries.

Leader election

  1. Followers expect a heartbeat from the leader within a randomized timeout (the election timeout, typically 150-300 ms).
  2. If a follower’s timer expires, it transitions to candidate, increments its term, votes for itself, and sends RequestVote RPCs to all other nodes.
  3. A candidate wins if it receives votes from a majority. Each node votes for at most one candidate per term, on a first-come-first-served basis, subject to a log-up-to-date check.
  4. The winner sends heartbeats (AppendEntries with no entries) to assert leadership.
  5. If two candidates split the vote, both time out and try again with new randomized timeouts. Randomization makes a second split unlikely.

The election-timeout range matters: too short and you get spurious elections under load, too long and failover is slow. Production etcd uses 1000 ms by default with 500 ms heartbeats; CockroachDB tunes per-range.

Log replication

  1. Clients send commands to the leader.
  2. The leader appends the command to its log (with the current term and the next index) and sends AppendEntries RPCs to followers in parallel.
  3. Each follower appends to its log and acknowledges.
  4. Once a majority (including the leader) has the entry, the leader marks it committed and applies it to its state machine.
  5. The leader piggybacks the new commit index on the next AppendEntries; followers apply entries up to the commit index to their state machines.

Safety: the log-matching property

Raft’s correctness rests on two invariants:

  • Log matching: if two logs contain an entry with the same index and term, they are identical up to that index.
  • Leader completeness: if an entry is committed in term T, that entry will be present in the logs of all leaders in terms > T.

These are maintained by the election restriction: a candidate only receives a vote if its log is at least as up-to-date (by (term, index)) as the voter’s. A candidate with a stale log cannot win, so any leader’s log contains all previously committed entries.

If a leader crashes after appending but before replicating, the entry may exist on some followers. The next leader’s log either contains the entry (and it survives) or does not (and the new leader will overwrite the followers’ tails with AppendEntries).

Snapshots and log compaction

The log grows without bound. Each node periodically takes a snapshot of its state machine, including the last included index and term, and truncates the log up to that point. Snapshots are shipped to lagging followers via InstallSnapshot RPC when the leader has already discarded the relevant log prefix.

Membership changes

Naively adding or removing nodes can split the cluster into two majorities with different views. Raft offers two approaches:

  • Joint consensus (original paper) — transition through an intermediate config that requires majorities in both old and new configs.
  • Single-server changes (later thesis) — add or remove one node at a time, since Cold and Cnew differ by exactly one node and must share a majority.

Most production Raft libraries (etcd’s, HashiCorp’s) implement single-server changes for simplicity.

5. Zab — ZooKeeper Atomic Broadcast

Apache ZooKeeper has used Zab since its 2008 release; the protocol was formally written up by Junqueira, Reed, and Serafini at Yahoo in 2011 (“Zab: High-performance broadcast for primary-backup systems,” DSN 2011). Zab predates Raft by three years and shares much of its structure — a leader-based atomic-broadcast protocol with terms (called epochs) and a replicated log.

What Zab solves

Zab is described as atomic broadcast rather than consensus, because the actual contract ZooKeeper exposes is total order on a stream of state updates. Each operation is broadcast to all replicas in the same order. This is equivalent to consensus on each slot but framed around the broadcast primitive.

Zab guarantees:

  • Reliable delivery — if one server delivers a message, all correct servers deliver it.
  • Total order — all servers deliver messages in the same order.
  • Causal order — messages from the same primary are delivered in the order sent.

Protocol phases

  1. Discovery — followers report their last accepted epoch and last accepted transaction to the prospective leader.
  2. Synchronization — the leader computes the longest accepted prefix and ships it to followers, bringing them up to date.
  3. Broadcast — the leader assigns a zxid (a 64-bit (epoch, counter) identifier) to each new transaction, sends it to followers as a PROPOSAL, collects ACKs from a quorum, then sends COMMIT.

The zxid ordering is the foundation of ZooKeeper’s strong consistency guarantees — clients can read their own writes by waiting for a state machine to reach a given zxid.

Where it’s used

ZooKeeper itself is the deployment surface — and ZooKeeper is the metadata-and-coordination backbone for Apache HBase, Apache Solr (SolrCloud), early Kafka (pre-2.8), Hadoop YARN HA, Apache Storm, Apache Mesos, Pulsar (BookKeeper), and many in-house systems. The Yahoo team built Zab specifically because Paxos’s failure-handling didn’t match their throughput target for primary-backup workloads — the recovery phase had to be cheap to keep up with Yahoo’s tens of thousands of clients.

6. PBFT — Practical Byzantine Fault Tolerance

Miguel Castro and Barbara Liskov introduced Practical Byzantine Fault Tolerance at OSDI 1999, with an expanded TOCS journal version (Castro and Liskov, “Practical Byzantine Fault Tolerance and Proactive Recovery,” 2002). It was the first BFT consensus protocol fast enough for real workloads — earlier protocols (Lamport’s Byzantine Generals, 1982) required exponential message complexity.

Setup

PBFT operates with n = 3f + 1 replicas, tolerating up to f Byzantine failures. The intuition: a quorum of 2f + 1 is required for any decision, and two such quorums must intersect in at least f + 1 nodes, which guarantees at least one honest node in the intersection (the f + 1 exceeds the f malicious nodes).

A primary is designated by view number (v mod n). Views change when the primary is suspected of being faulty.

Three-phase protocol

For each client request:

  1. Pre-prepare — the primary assigns a sequence number n and broadcasts <PRE-PREPARE, v, n, request> to all backups, signed.
  2. Prepare — each backup verifies the pre-prepare and broadcasts <PREPARE, v, n, digest> to everyone. A replica becomes prepared when it has the pre-prepare and 2f matching prepares (so 2f + 1 total replicas agree on (v, n, digest)).
  3. Commit — once prepared, each replica broadcasts <COMMIT, v, n, digest>. When a replica has 2f + 1 commits, it executes the request and replies to the client.

The client waits for f + 1 matching replies before accepting the result.

View changes

When backups suspect the primary (request times out), they multicast a VIEW-CHANGE message. The new primary (next in rotation) collects 2f + 1 view-change messages and reconstructs the unfinished operations using their prepared certificates, then broadcasts a NEW-VIEW to restart.

Cost

Each operation requires O(n^2) messages because of the all-to-all broadcasts in prepare and commit. This limits PBFT to small clusters — typically under 30 nodes. Cryptographic signatures (RSA in the original; MACs as an optimization for known-set membership) dominate CPU under load. PBFT is the basis for Hyperledger Fabric (versions through 1.x) and many permissioned-blockchain systems.

7. HotStuff

HotStuff (Yin, Malkhi, Reiter, Gueta, Abraham; PODC 2019) is a BFT consensus protocol that achieves linear message complexity per decision under the steady state and linear view-change cost. It is the consensus algorithm Diem (formerly Libra) was built on, and it influenced Aptos, Sui, and several other modern blockchain systems.

The chained / pipelined structure

HotStuff replaces PBFT’s all-to-all broadcasts with a star topology: the leader collects votes and produces a quorum certificate (QC) — a threshold signature aggregating 2f + 1 votes into a single constant-size object. Followers don’t talk to each other; they only talk to the leader. This drops per-decision cost from O(n^2) to O(n).

The protocol has three phases — prepare, pre-commit, commit — each ending in a QC. To get steady-state throughput, HotStuff pipelines them: the prepare-phase of decision k+1 rides on the same vote message as the pre-commit phase of decision k, and so on. This is the chained HotStuff variant most implementations use.

Why it matters

The combination of (a) linear cost per decision and (b) linear-cost view change (no view-change certificate explosion) made BFT viable for high-throughput blockchains for the first time. Before HotStuff, BFT was practical only for small permissioned consortia; with HotStuff, hundreds-of-validators networks become feasible.

8. Tendermint and BFT-SMR

Tendermint (Kwon 2014) is a BFT consensus protocol that powers the Cosmos blockchain ecosystem. It is a leader-based, round-based BFT-SMR (State Machine Replication) protocol with two phases per round: pre-vote and pre-commit. It tolerates f < n/3 Byzantine validators.

Each round has a designated proposer (round-robin weighted by stake). The proposer broadcasts a block; validators pre-vote (locking onto the proposal if 2f + 1 pre-votes match), then pre-commit. A block is committed when 2f + 1 pre-commits land. If the round times out, the next round begins with a new proposer.

Tendermint provides instant finality — once a block is committed, it is final (no probabilistic confirmation depth needed). This is the basis of IBC (Inter-Blockchain Communication), the Cosmos protocol for cross-chain messaging: IBC can trust a block immediately because Tendermint never produces forks.

9. Proof-of-Work and Proof-of-Stake

Nakamoto consensus, the original mechanism in the Bitcoin whitepaper (Satoshi Nakamoto, “Bitcoin: A Peer-to-Peer Electronic Cash System,” 2008), is structurally different from the classical protocols above. It is probabilistic and economically secured rather than guaranteed safe under a bounded fault model.

Proof-of-Work (PoW)

Miners race to find a nonce such that SHA256(SHA256(block_header)) is below a difficulty target. The first to find one broadcasts the block; other nodes verify and append it. Longest-chain rule picks the canonical chain when forks occur. Safety is probabilistic — after k confirmations, the probability of reorganization decreases exponentially in k. Bitcoin’s typical confirmation depth is 6 blocks (~60 minutes); for large-value transactions, more.

PoW makes Byzantine attacks economically infeasible: an attacker needs to control >50% of network hash rate (currently hundreds of millions of USD per hour for Bitcoin) to fork the chain. The protocol tolerates arbitrary numbers of byzantine non-mining nodes; what matters is hash-rate share, not node count.

FLP is sidestepped because consensus is probabilistic, not deterministic — there is always a non-zero chance of reorganization, but it decays rapidly.

Proof-of-Stake (PoS)

Ethereum transitioned from PoW to PoS in September 2022 (“The Merge”). The Ethereum PoS protocol is Gasper — a combination of LMD-GHOST (a fork-choice rule, Buterin et al.) and Casper FFG (a finality gadget by Buterin and Griffith, 2017). Validators stake 32 ETH each and are randomly selected to propose and attest to blocks. Every 32 blocks (an “epoch”), validators vote on finality; once two-thirds of stake votes for a checkpoint, it is finalized — economically irreversible barring a coordinated slashing attack that burns one-third of staked ETH.

Other PoS designs include Cardano (Ouroboros, Kiayias et al. 2017), Algorand (Chen and Micali 2017, using verifiable random functions), and Solana (Proof-of-History + PoS).

The blockchain world generally collapses two distinct problems together: Sybil resistance (who is allowed to vote?) and consensus (how do they agree?). Classical consensus assumes a known validator set; blockchain consensus has to admit anyone willing to pay (PoW work, or PoS stake) and then reach agreement among them.

10. Raft details deep

This section drills into the details that matter for implementing or operating Raft, since it is the protocol most production systems are built on.

Election timeout tuning

The randomized election timeout (often 150-300 ms in textbook descriptions, scaled up to 500-1500 ms in production) is the dominant source of failover latency. Two constraints fight each other:

  • Spurious-election prevention — the timeout must be much longer than typical network RTT and heartbeat interval, or transient delays trigger unnecessary leader changes. Rule of thumb: electionTimeout >= 10 * heartbeatInterval and electionTimeout >> broadcast time.
  • Failover speed — the timeout sets the maximum delay before a dead leader is detected. A 1-second timeout means up to 1 second of unavailability after a crash.

Production etcd defaults: 1000 ms election timeout, 100 ms heartbeat (10:1 ratio). Across geo-distributed clusters the timeout must accommodate cross-region latency; CockroachDB tunes per-range based on observed RTT.

AppendEntries RPC

The leader’s workhorse RPC carries:

term            // leader's term
leaderId
prevLogIndex    // index of log entry immediately preceding new ones
prevLogTerm     // term of prevLogIndex entry
entries[]       // new entries (empty for heartbeat)
leaderCommit    // leader's commitIndex

The follower’s consistency check: if my log at prevLogIndex has term prevLogTerm, append the new entries (truncating any conflicting suffix). Otherwise reject — the leader will retry with a smaller prevLogIndex until it finds a match. This is the log-matching property in action.

Commit index advancement

The leader tracks matchIndex[i] for each follower — the highest log index known to be replicated on follower i. The leader’s commitIndex is the highest index n such that:

  1. A majority of matchIndex[]n.
  2. log[n].term == currentTerm (the critical safety check — leaders cannot commit entries from prior terms by majority count alone; doing so allows a known unsafe scenario described in Figure 8 of the Raft paper).

The second condition is why Raft leaders commit a no-op entry immediately on election: it advances currentTerm-aligned commits and lets older entries piggyback to safety.

Snapshot and chunked transfer

When a follower falls so far behind that the leader has compacted its log past the follower’s last index, InstallSnapshot ships the snapshot in chunks. Chunks are sized to fit the network MTU and the snapshot is reassembled by the follower into a temporary file, then atomically renamed. The follower discards its log and resumes from the snapshot’s last included index.

Production tuning: snapshot creation must run on a background thread or in a forked process (CockroachDB uses Pebble’s incremental snapshots; etcd uses BoltDB then BBolt). A foreground snapshot will stall the apply loop and trigger spurious elections.

Membership changes — single-server change

To add node D to cluster {A, B, C}:

  1. Leader receives AddServer(D) request.
  2. Leader appends a configuration log entry that includes D.
  3. Leader replicates this entry. As soon as it is committed (under the new majority rule — {A, B, C, D} requires 3-of-4), D is a full member.
  4. The leader catches D up via normal AppendEntries (or InstallSnapshot if needed).

The single-server invariant: Cold and Cnew differ by one node, so they share a majority — there is no time when two disjoint majorities can form.

Removing a leader is the subtle case: the leader removes itself, replicates the change, and steps down once it commits the new config. The new majority elects a fresh leader.

11. Real implementations

etcd

etcd (CoreOS, 2013; now CNCF and Red Hat) is the canonical Raft implementation in Go. It is the Kubernetes control-plane store, holding all cluster state (Pods, Services, Nodes, Secrets) in a single linearizable key-value database backed by Raft. Every kubectl apply writes through etcd. The library go.etcd.io/etcd/raft is also used directly by TiKV, CockroachDB (its early versions), and many in-house systems.

Storage backend: BoltDB historically, BBolt (a fork) currently, with a planned move to Pebble. Each commit fsyncs to disk before the leader acks.

Typical deployment: 3 or 5 nodes per cluster, dedicated machines, low-latency dedicated network. Performance: tens of thousands of writes per second on commodity hardware.

Apache ZooKeeper

ZooKeeper (Yahoo, 2007; Apache top-level project) uses Zab. It is in production at every large hyperscaler and at virtually every Hadoop / Solr / HBase / Kafka deployment as the coordination layer. Despite its age and well-known operational sharp edges (the “JVM with split-brain GC pauses” reputation), it remains the reference distributed-coordination system.

ZooKeeper is currently being displaced — Kafka has removed it, HBase has Quorum alternatives, Solr is decoupling — but it still runs critical infrastructure at most companies that adopted Hadoop in the 2010s.

Apache Kafka KRaft

Kafka shipped KRaft mode (Kafka Raft) as a preview in Kafka 2.8 (April 2021) and made it the default in 3.3 (October 2022). It replaces ZooKeeper with a native Raft implementation embedded in Kafka brokers, eliminating the operational and deployment complexity of running a separate ZooKeeper ensemble. Kafka 4.0 (2025) removed ZooKeeper support entirely.

KRaft holds cluster metadata (topics, partitions, broker membership, ACLs) in a Raft-replicated metadata topic. The controller quorum is a small set of dedicated Kafka brokers (3-5 nodes) acting as Raft voters. The metadata log is then replicated to other brokers as a passive replica via Kafka’s normal fetch protocol — every broker has a current view of metadata without itself voting.

HashiCorp Consul and Vault

Consul (service discovery + KV store) and Vault (secrets management) both use the HashiCorp Raft Go library (github.com/hashicorp/raft). The library is independent of etcd’s and predates it; it pioneered single-server membership changes and several operational features like non-voting members (read replicas).

Vault clusters typically run 3 or 5 nodes with Raft as the integrated storage backend. Vault 1.4+ defaults to integrated Raft storage instead of Consul.

TiKV

TiKV (the storage layer of TiDB, originally PingCAP, now CNCF) implements Multi-Raft: the keyspace is sharded into many regions (~96 MB each by default), and each region is an independent Raft group with its own log. A 100-node TiKV cluster might run hundreds of thousands of Raft groups concurrently.

The Raft implementation is in Rust (raft-rs, a port and evolution of etcd’s Go library). Region splits and merges are coordinated by a Placement Driver (PD) — itself an etcd cluster.

CockroachDB

CockroachDB is the prototypical Multi-Raft SQL database. The keyspace is split into ranges (~512 MiB each by default), each its own Raft group. Distributed SQL queries fan out to ranges, and transactions span multiple Raft groups using two-phase commit on top of Raft. Originally built on etcd’s Go Raft library, CockroachDB now uses its own fork specialized for the workload (parallel commits, quotas, latches).

Geographic distribution is a first-class concern: ranges can be configured with replicas in specific zones, and the lease holder can be pinned to minimize cross-region latency.

YugabyteDB

YugabyteDB uses Raft per tablet (analogous to CockroachDB’s per-range Raft). It exposes both a Postgres-wire SQL surface (YSQL) and a Cassandra-compatible API (YCQL) on top of the same Raft-replicated storage layer. Multi-region active-active deployments rely on Raft’s strong-consistency guarantees per tablet plus higher-level conflict resolution.

Google Spanner

Spanner (Corbett et al., 2012, OSDI) uses Multi-Paxos per tablet (Paxos group), with the global ordering provided by TrueTime — Google’s bounded-uncertainty clock service backed by GPS and atomic clocks. Each Paxos group elects a leader; the leader serializes writes for that tablet. TrueTime allows Spanner to assign timestamps that respect global causality despite clock skew, enabling external consistency at planet scale.

This is the largest-scale Paxos deployment in production: Spanner runs in every Google data center and underpins F1, Google’s primary SQL store for AdWords and many other internal services.

MongoDB

MongoDB replaced its legacy primary-elect protocol with a Raft-like “replica-set protocol version 1” in MongoDB 3.2 (December 2015), and made it the default in 3.6 (November 2017). The protocol is faithful to Raft semantics — terms, vote-once-per-term, log-up-to-date election restriction — with MongoDB-specific tweaks for chained replication (replication via the secondary closest in network distance) and read concerns.

Other notable implementations

  • NATS JetStream uses Raft for cluster metadata and stream replication. Each stream is a separate Raft group, similar to TiKV/CockroachDB’s multi-Raft approach but at a coarser granularity.
  • Apache BookKeeper uses a different model (write to a quorum of bookies, not classical consensus) but is often paired with ZooKeeper for metadata. The BookKeeper protocol is similar in spirit to chain replication but with parallel ack from a quorum.
  • FoundationDB uses Active Disk Paxos (Chand 2002) for the configuration log and a higher-throughput custom replication for the data plane. The choice reflects the workload split: rare config changes need strong consensus, while data writes need throughput.
  • MariaDB Galera Cluster uses certification-based replication with a virtual-synchrony group communication system — not Raft, not Paxos, but in the same problem space. Transactions execute optimistically on the local node, then certify against the group order at commit time. Conflicts cause rollback rather than two-phase commit.
  • Apache Pulsar uses ZooKeeper for metadata and BookKeeper for the actual log replication. Each topic partition’s messages live in a sequence of BookKeeper ledgers, with the metadata about which ledgers belong to which topic in ZooKeeper.
  • Apache Cassandra historically used a gossip + tunable consistency model rather than classical consensus, but added Paxos for lightweight transactions (LWT) — IF-NOT-EXISTS conditional inserts and the like — in Cassandra 2.0 (2013). LWT operations are orders of magnitude slower than normal writes because every operation runs a full four-round Paxos exchange.
  • Microsoft Azure Service Fabric uses a custom variant of Paxos for its ring topology. Microsoft Cosmos DB uses its own version of Paxos derivatives across regions.

Smaller-scale and embedded implementations

  • Hashicorp Nomad — uses the same Raft library as Consul and Vault for scheduling decisions.
  • Patroni — a Postgres HA tool that uses etcd, Consul, or ZooKeeper as the consensus substrate (it doesn’t implement Raft itself; it leases leadership through one of the above).
  • rqlite — a SQLite distributed via Raft, useful for embedded or single-machine HA scenarios.
  • dqlite (Canonical) — distributed SQLite via Raft, used by LXD and MicroK8s.

The diversity of implementations is itself a story: Raft has won as the default choice for new systems since 2014, but legacy Paxos deployments (Spanner, Chubby, Cosmos DB’s variants) continue to scale, and BFT protocols have a permanent niche in adversarial environments.

12. Performance characteristics

Commit latency

In all leader-based protocols (Paxos Multi-Paxos, Raft, Zab, PBFT), the steady-state commit latency for a single operation is dominated by one round-trip from the leader to the slowest member of the committing majority. For a 3-node Raft cluster, this is the slower of two follower RTTs. For a 5-node cluster, the slower of the third-fastest follower.

This explains why production clusters tend to be 3 or 5 nodes — adding more nodes does not reduce commit latency (you still need a majority) but increases the chance of a slow node.

Throughput

Throughput is bounded by the leader’s ability to drive the protocol. Two optimizations dominate real implementations:

  • Pipelining — the leader does not wait for one entry’s ack before sending the next. AppendEntries with entries=[k], then entries=[k+1], then entries=[k+2] overlap in flight. Commit index advances as acks come back in any order.
  • Batching — multiple client requests are coalesced into a single AppendEntries / accept message. This amortizes per-RPC overhead (TLS handshakes, serialization, syscalls).

etcd routinely runs tens of thousands of ops/sec on commodity hardware. CockroachDB scales linearly with the number of ranges (each range is independent), so cluster-wide throughput is millions of ops/sec at scale.

Read latency

Linearizable reads are surprisingly expensive — a naive linearizable read requires the leader to confirm it is still the leader by exchanging a round of heartbeats with a quorum (otherwise it might be a stale leader serving from cached state under partition). Optimizations:

  • Leader leases — the leader holds a time-bounded lease (extended on each heartbeat). Reads served during the lease are safe without a quorum check. CockroachDB and TiKV both use leases.
  • Read index — etcd’s mechanism: the leader records its current commit index, confirms leadership via heartbeats, then serves the read once the local state machine has applied through that index.
  • Follower reads / replica reads — relaxed-consistency reads served from followers. CockroachDB exposes these for analytical queries that tolerate staleness in exchange for parallelism.

13. Pitfalls and operational concerns

Split-brain prevention

Two leaders simultaneously serving writes is the apocalypse scenario. Raft prevents it by requiring a quorum for election (only one majority can exist at a time) and by terms (a stale leader’s appendEntries will be rejected by any node with a higher term).

However, a leader can be partitioned from a majority but still believe itself the leader, accepting writes that will never commit. The fix is leader leases plus lease epochs / fencing tokens: when a new leader takes over, it bumps an epoch number, and any subsequent write from the old leader is rejected by storage based on stale epoch. Etcd uses this; so does Spanner (epoch-based mastership).

For external services (e.g., GCS, S3) that the cluster talks to, fencing requires the external service to also reject stale epochs. ZooKeeper exposes its zxid for this purpose; Chubby exposes sequence numbers in lock claims.

Disk fsync ordering

Raft’s safety proof depends on a critical invariant: when a node acknowledges a log entry, that entry is persisted to stable storage. If the node acks before fsync and then crashes, it may re-vote on a different entry at the same index after restart — violating log matching.

This means every AppendEntries response is bounded by disk fsync latency. On consumer SSDs this is 50-200 µs; on enterprise NVMe it can be sub-50 µs; on cloud-attached storage (EBS, GP3) it can spike to milliseconds. Production etcd clusters require dedicated low-latency SSDs precisely for this reason.

The WAL (write-ahead log) structure is straightforward: append entries to a file, fsync, respond. The trickiness is in batching — multiple in-flight entries should share one fsync to amortize.

Bandwidth

Raft replicates the full log to all followers. A cluster running a large state machine (say, a database with terabytes of state) ends up shipping terabytes of log over time. Mitigations:

  • Compaction via snapshots — periodic state-machine snapshots truncate the log.
  • Incremental snapshots — ship only the diff from the follower’s current snapshot. Pebble’s snapshots in CockroachDB do this.
  • Chunked transfer — ship snapshots in MTU-sized chunks with checksumming and resumption.

For huge state, the snapshot transfer itself is the bottleneck during catch-up. CockroachDB observed snapshot transfers taking hours for slow-network deployments; the fix was rate-limiting plus prioritizing certain ranges.

Membership-change pitfalls

The classic mistake: adding three nodes at once to a three-node cluster. If the three new nodes are added through three separate single-server changes but the first one is interrupted mid-config-change, the second can apply with a wrong baseline. Always serialize membership operations through a single coordinator.

Removing the leader is also subtle: the leader must replicate its own removal, commit it, then step down. If it crashes between replication and commit, the next leader inherits a config that may or may not include the old leader, with corresponding edge cases.

Time and clocks

Raft is mostly clock-free — it uses local timers for the election timeout but does not require synchronized clocks. This is by design. However, leader leases (an optimization above bare Raft) do assume bounded clock drift. Misconfigured NTP or VM-paused clocks have caused real incidents: a leader’s lease expires while it thinks it is still valid, and a new leader gets elected, resulting in two leaders briefly. CockroachDB mitigates with a max-offset configuration that causes nodes to suicide rather than serve under suspected clock skew.

Spanner sidesteps this with TrueTime: every clock read returns an interval [earliest, latest] and operations wait out the uncertainty bound before committing. This is unique to Google’s infrastructure (GPS + atomic clocks per datacenter).

Operator dangers

  • 3-node clusters tolerating zero failures while a node is being upgraded. Standard procedure: upgrade one node at a time, wait for full re-replication, then move on.
  • Asymmetric network partitions — node A can talk to B but not C, while C can talk to B but not A. These can cause flapping elections; some Raft libraries add PreVote (Ongaro thesis) — a candidate first checks if it could win before incrementing term, preventing partitioned nodes from disrupting a healthy cluster on rejoin.
  • Disk-full conditions on the leader — fsync starts failing, the leader cannot commit, followers time out and call an election, the new leader has the same problem. Proper monitoring and capacity planning is essential.

14. Cross-references

  • [[Compute/distributed-systems-fundamentals]] — CAP theorem, FLP impossibility, partial synchrony, the broader theoretical context that consensus protocols live in.
  • [[Compute/database-internals]] — replication, WAL, snapshot semantics from the database side; how consensus protocols plug into a storage engine.
  • [[Compute/kubernetes-deep]] — etcd as the K8s control-plane store; how Raft latency affects API-server responsiveness; control-plane failure modes.
  • [[Engineering/networking-stack]] — TCP, TLS, RTT — the substrate consensus runs on.
  • [[Compute/storage-systems]] — fsync semantics, WAL design, snapshots; the disk-side correctness story.

15. Citations and further reading

Foundational papers

  • Lamport, L. (1998). “The Part-Time Parliament.” ACM Transactions on Computer Systems 16(2), 133-169. The original Paxos paper, with the Greek-island metaphor.
  • Lamport, L. (2001). “Paxos Made Simple.” ACM SIGACT News 32(4), 51-58. The simplified explanation. Required reading paired with the 1998 paper.
  • Ongaro, D., and Ousterhout, J. (2014). “In Search of an Understandable Consensus Algorithm.” USENIX Annual Technical Conference, 305-319. The Raft paper. Best-paper award.
  • Ongaro, D. (2014). “Consensus: Bridging Theory and Practice.” Stanford PhD thesis. The definitive Raft reference; covers PreVote, single-server changes, log compaction, leader transfer.
  • Junqueira, F., Reed, B., and Serafini, M. (2011). “Zab: High-performance broadcast for primary-backup systems.” DSN 2011. The Zab paper.
  • Castro, M., and Liskov, B. (1999). “Practical Byzantine Fault Tolerance.” OSDI 1999, 173-186. The PBFT paper.
  • Castro, M., and Liskov, B. (2002). “Practical Byzantine Fault Tolerance and Proactive Recovery.” ACM Transactions on Computer Systems 20(4), 398-461. The journal version with proofs and extensions.
  • Yin, M., Malkhi, D., Reiter, M., Gueta, G., and Abraham, I. (2019). “HotStuff: BFT Consensus with Linearity and Responsiveness.” PODC 2019, 347-356. The HotStuff paper.

Impossibility and theory

  • Fischer, M., Lynch, N., and Paterson, M. (1985). “Impossibility of Distributed Consensus with One Faulty Process.” Journal of the ACM 32(2), 374-382. The FLP result.
  • Dwork, C., Lynch, N., and Stockmeyer, L. (1988). “Consensus in the Presence of Partial Synchrony.” Journal of the ACM 35(2), 288-323. Partial-synchrony model; GST.
  • Lamport, L., Shostak, R., and Pease, M. (1982). “The Byzantine Generals Problem.” ACM TOPLAS 4(3), 382-401. The original Byzantine-failure formalism.
  • Schneider, F. (1990). “Implementing fault-tolerant services using the state machine approach: A tutorial.” ACM Computing Surveys 22(4), 299-319. The state-machine-replication abstraction.

Production-system papers

  • Burrows, M. (2006). “The Chubby Lock Service for Loosely-Coupled Distributed Systems.” OSDI 2006. The original Multi-Paxos production system, with operational lessons.
  • Chandra, T., Griesemer, R., and Redstone, J. (2007). “Paxos Made Live — An Engineering Perspective.” PODC 2007. The most practically-useful Paxos paper.
  • Corbett, J. et al. (2012). “Spanner: Google’s Globally-Distributed Database.” OSDI 2012. Multi-Paxos at Google scale with TrueTime.
  • Hunt, P., Konar, M., Junqueira, F., and Reed, B. (2010). “ZooKeeper: Wait-free coordination for Internet-scale systems.” USENIX ATC 2010. ZooKeeper architecture.

Blockchain and BFT extensions

  • Nakamoto, S. (2008). “Bitcoin: A Peer-to-Peer Electronic Cash System.” Whitepaper. The Nakamoto-consensus origin.
  • Buterin, V., and Griffith, V. (2017). “Casper the Friendly Finality Gadget.” arXiv:1710.09437. Ethereum’s PoS finality.
  • Kiayias, A. et al. (2017). “Ouroboros: A Provably Secure Proof-of-Stake Blockchain Protocol.” CRYPTO 2017. Cardano’s PoS.
  • Kwon, J. (2014). “Tendermint: Consensus without Mining.” Whitepaper. Cosmos’s BFT-SMR protocol.
  • Chen, J., and Micali, S. (2017). “Algorand: A secure and efficient distributed ledger.” Theoretical Computer Science. Algorand’s VRF-based protocol.

Implementation references

  • etcd Raft library — github.com/etcd-io/raft (formerly go.etcd.io/etcd/raft).
  • HashiCorp Raft library — github.com/hashicorp/raft. Used by Consul, Vault, Nomad.
  • TiKV raft-rsgithub.com/tikv/raft-rs. Rust port of etcd’s Raft.
  • Kafka KIP-500 — the proposal to remove ZooKeeper and adopt KRaft. Background for understanding modern Kafka.

Operational and pedagogical resources

  • Heidi Howard’s blog (hh360.github.io) — modern Paxos / Raft research, including Flexible Paxos.
  • Murat Demirbas’s blog (muratbuffalo.blogspot.com) — clear write-ups of distributed-consensus papers as they come out.
  • Aleksey Charapko’s blog (charap.co) — consensus protocols and replication research, including Compartmentalized Paxos and EPaxos.
  • The Raft Refloated paper (Howard, Schwarzkopf, Madhavapeddy, Crowcroft 2015) — an OCaml reimplementation that surfaced subtleties in the original Raft description.