Consistency Models — Cross-System Comparison
This note compares the consistency model offered by every database, datastore, and distributed-data system referenced in the Compute library. Each section pins the model on a single axis — linearizability ↓ eventual — then maps real systems to it. Use the tables to look up “what does System X actually promise”; use the decision tree to pick a model for a new workload.
See also
- database-internals
- databases-internals-deep
- distributed-systems-fundamentals
- consensus-protocols
- crdts-and-distributed-data-types
- sql-nosql-design
- database-engine-taxonomy
1. The consistency hierarchy
strongest weakest
| |
linearizable → sequential → causal+ → causal → RYW → MR → MW → SE → eventual
Spanner traditional COPS Bayou sessions monotonic CRDTs DynamoDB
Calvin single-node Eiger reads writes Riak Cassandra
etcd/Raft in-mem DB (default)
ZooKeeper
- Linearizable (Herlihy-Wing 1990) — operations appear atomic, in some real-time order. Strongest.
- Sequential (Lamport 1979) — operations appear in some total order, agreed by all processes; does not need to be real-time.
- Causal+ (Lloyd-Freedman-Kaminsky-Andersen, COPS, SOSP 2011) — causal consistency + convergent conflict handling. The strongest model achievable under partition + low latency.
- Causal (Hutto-Ahamad 1990) — operations that are causally related are seen in causal order; concurrent ops can be observed in different orders.
- Read-your-writes (RYW) — a process always sees its own writes (session model).
- Monotonic read — successive reads see ≥ previous read’s version.
- Monotonic write — process’s own writes are applied in issue-order.
- Strong-eventual (SEC) (Shapiro-Preguiça-Baquero-Zawirski 2011) — convergent state given any delivery order (CRDT-style).
- Eventual (Vogels 2008 ACM Queue) — if no new writes, all replicas converge.
- PRAM / FIFO (Pipelined RAM, Lipton-Sandberg 1988) — writes by one process seen in issue order by all others; concurrent writes can interleave.
2. CAP, PACELC — the framing
CAP (Brewer 2000 PODC keynote; Gilbert-Lynch 2002 proof) — under network partition, a system must choose C (consistency = linearizability) or A (availability = every request answered). Never both.
PACELC (Daniel Abadi 2010) — extends CAP. If a Partition: A vs C. Else (no partition): Latency vs Consistency. The L vs C tradeoff is the practical one — partitions are rare; latency-vs-consistency is paid every single operation.
| System | Partition behavior | Else behavior | Classification |
|---|---|---|---|
| Spanner | CP | CC (TrueTime adds bounded staleness) | CP/CC |
| Cassandra (QUORUM) | AP | EL (eventual on low latency) | AP/EL |
| DynamoDB (strong) | CP | CC | CP/CC |
| DynamoDB (eventual) | AP | EL | AP/EL |
| MongoDB (default w=majority) | CP-ish (configurable) | EC or EL depending on readConcern | CP/EC |
| Cosmos DB | CP/AP per consistency level | configurable (5 levels) | configurable |
| FoundationDB | CP (strict serializable) | CC | CP/CC |
| CockroachDB | CP (serializable) | CC | CP/CC |
| TiDB | CP (snapshot isolation by default) | CC | CP/CC |
| YugabyteDB | CP | CC | CP/CC |
| Riak | AP | EL | AP/EL |
| Aerospike | AP (default) or CP (strong-consistency mode) | EL | AP/EL or CP/CC |
| Redis (single-node) | n/a | n/a (linearizable single-instance) | linearizable |
| Redis Cluster | AP | EL (async replication) | AP/EL |
| Etcd (Raft) | CP | CC (linearizable reads) | CP/CC |
| ZooKeeper (ZAB) | CP | sequential consistency | CP/CC (sequential not linearizable) |
3. Database family → consistency model
Mapping every system in database-internals and database-engine-taxonomy.
3.1 Relational OLTP
| System | Default isolation | Strictest available | Replication mode | Notes |
|---|---|---|---|---|
| PostgreSQL | Read Committed | Serializable (SSI, Kemme-Alonso-Ports 2012) | sync (synchronous_commit) or async streaming | SSI is the gold standard for predicate-aware serializability |
| MySQL InnoDB | REPEATABLE-READ (note: not actual repeatable read — has phantoms) | SERIALIZABLE | async binlog, semi-sync, group replication (MGR, 8.0) | InnoDB’s RR is MVCC w/ snapshot at first read |
| MySQL Galera (MariaDB Cluster) | SERIALIZABLE per group | linearizable across cluster | synchronous certification | virtually synchronous; group-commit |
| SQL Server | Read Committed (default) | Serializable | always-on availability groups (sync/async) | also has Snapshot isolation |
| Oracle | Read Committed | Serializable | Active Data Guard (sync/async) | does NOT have true SERIALIZABLE (SSI-like) historically; only snapshot |
| SQLite | Serializable (single-writer, single-process) | Serializable | n/a (embedded) | trivially linearizable on disk via WAL |
| Aurora (Postgres / MySQL) | inherits engine | inherits engine | 6/6 quorum storage, 4/6 write, 3/6 read | storage layer is decoupled; compute layer single-master (until Aurora DSQL 2024 multi-region multi-master) |
3.2 NewSQL (geo-distributed)
| System | Default isolation | Replication | Mechanism |
|---|---|---|---|
| Google Spanner | external consistency (linearizable + bounded staleness) | Paxos across replicas | TrueTime (Marzullo + GPS + atomic clocks, ~7 ms epsilon) |
| Google Spanner stale reads | bounded staleness (configurable) | Paxos | TrueTime + read timestamp |
| CockroachDB | serializable | multi-Raft per range | HLC (hybrid logical clock, Kulkarni-Demirbas-Madappa 2014) |
| YugabyteDB | serializable + snapshot isolation | Raft per tablet | HLC-based |
| TiDB | snapshot isolation (default) | multi-Raft per region | PD-issued global timestamp via TSO oracle |
| FoundationDB | strict serializable | Paxos + state-machine replication | resolver-coordinated transaction read-version |
| Calvin (Yale, Thomson-Diamos-Weng-Ren-Shao-Abadi 2012) | strict serializable | deterministic order across replicas | sequencer pre-orders before execution |
| Amazon Aurora DSQL (re:Invent 2024) | strong (linearizable) | active-active multi-region | journal-based; per-row Raft-ish |
3.3 Key-value / wide-column
| System | Default | Strongest available | Replication |
|---|---|---|---|
| DynamoDB | eventually consistent reads | strongly consistent reads (single-AZ) | three-AZ sync; transaction tables for atomic multi-item |
| Cassandra | tunable (ONE/QUORUM/ALL) | QUORUM read+write → quorum-consistency | gossip + LWT (Paxos-based) for compare-and-set |
| ScyllaDB | tunable (Cassandra-compatible) | QUORUM, LWT | seastar-based, low-latency Cassandra clone |
| Riak KV | tunable (R+W>N for “strong”) | eventual; with Dotted Version Vectors (Almeida et al 2014) | active anti-entropy; AAE merkle trees |
| Aerospike (default AP mode) | eventual | per-record consistency | replication factor; strong-consistency mode (SC) since 4.0 (Paxos) |
| Aerospike (SC mode) | strict serializable | linearizable single-record read/write | Roster-based; sync replication |
| etcd | linearizable reads (default) + lease-based watches | linearizable | Raft |
| ZooKeeper | sequential consistency | sequential + linearizable writes | ZAB |
| Hazelcast IMap | tunable (read-from-backup vs leader) | linearizable | CP subsystem (Raft) since 3.12 |
| Apache HBase | strong per-row, eventual cross-row | strong per-row | HDFS-backed; HMaster + RegionServer |
| Bigtable | strong per-row | strong per-row + atomic | Colossus-backed |
3.4 Document
| System | Default consistency | Sessions / causal | Notes |
|---|---|---|---|
| MongoDB | majority writeConcern (default ≥ 5.0) | causal-consistency session w/ afterClusterTime | logical clocks (cluster time) |
| Couchbase | tunable (per-bucket durability levels: NONE/MAJORITY/MAJORITY-AND-PERSIST-ACTIVE/PERSIST-MAJORITY) | session consistency via N1QL hints | RYW per session |
| CouchDB | eventual | n/a | revision-tree merge resolution |
| AWS DocumentDB | strong (Aurora-like storage) | n/a | distinct from Mongo despite wire-compat |
3.5 Time-series
| System | Consistency | Notes |
|---|---|---|
| InfluxDB | eventual (open source) | tags + fields + retention buckets |
| InfluxDB Enterprise / Cloud | tunable | hinted handoff anti-entropy |
| TimescaleDB | inherits Postgres (Serializable available) | hypertables on Postgres |
| Prometheus | local-only (no replication) | scrape-based; Thanos / Cortex / Mimir add replication |
| Mimir / Thanos / Cortex | configurable | typically eventual via S3 backing |
| Druid | eventual (segment-level) | indexer + historical pattern |
| ClickHouse | tunable; default async; INSERT_QUORUM for sync | RANS / Atomic / Replicated engines |
3.6 Graph
| System | Consistency | Notes |
|---|---|---|
| Neo4j | strict serializable single-instance, causal-cluster RYW | Raft for causal clustering |
| TigerGraph | strong per-machine, eventual cluster | partition-tolerant by design |
| ArangoDB | tunable | RocksDB engine; resilient single-server replication |
| JanusGraph | depends on backend (Cassandra → tunable; HBase → strong per-row; Bigtable → strong per-row) | |
| Amazon Neptune | strong reads (read replicas eventual w/ option for read-after-write) | shared storage like Aurora |
3.7 Vector
| System | Consistency | Notes |
|---|---|---|
| Pinecone | strong index reads; freshness window for writes | ”freshness ~30s by default” |
| Weaviate | tunable, similar to Cassandra (consistency-level per operation) | Raft-based metadata, async vector replication |
| Milvus | eventual; bounded staleness option | strong via BOUNDED consistency level |
| Qdrant | eventual; configurable | gossip cluster |
| pgvector (in Postgres) | inherits Postgres (Serializable available) | depends on cluster setup |
| Chroma | single-node (linearizable trivially); cluster-mode emerging |
3.8 In-memory caches
| System | Consistency | Notes |
|---|---|---|
| Redis (standalone) | linearizable single-instance | single-threaded command loop |
| Redis Cluster | AP / eventual | async replication; sentinel for failover |
| Redis Enterprise | tunable; Active-Active with CRDT semantics | per-key CRDT; “EAC” (eventual active-active consistency) |
| Memcached | best-effort; client-side hashing | no replication |
| Hazelcast (CP subsystem) | linearizable | Raft |
3.9 Streaming / log
| System | Consistency | Notes |
|---|---|---|
| Kafka | per-partition total order; configurable producer acks (0/1/all/-1) | ISR-based; min.insync.replicas for durability |
| Kafka transactional | exactly-once across partitions | producer ID + epoch; idempotent producers |
| Pulsar | per-partition strong; cross-partition eventual | BookKeeper-backed ledger w/ quorum write |
| RabbitMQ Streams | per-stream strong | mirroring + Raft (since 3.10) for quorum queues |
| AWS Kinesis | per-shard order | shard iterator; trimming horizon |
| Redpanda | Kafka-compatible; Raft per partition | low-latency engine, single binary |
4. CRDTs — the strong-eventual family
crdts-and-distributed-data-types details these. CRDTs achieve strong-eventual consistency by structuring operations to be:
- Commutative — order doesn’t matter (e.g., set-union, max counter).
- Associative — grouping doesn’t matter.
- Idempotent — applying twice = once.
This means any delivery order converges to the same state. No coordination, no consensus — but the data type’s algebra is constrained.
| CRDT | Use | Where it ships |
|---|---|---|
| G-Counter (grow-only) | counters | Riak |
| PN-Counter (positive + negative) | counters supporting decrement | Riak, Redis Enterprise |
| LWW-Register | last-write-wins single value | Cassandra columns, Cosmos DB session |
| Multi-Value Register | concurrent writes preserved | Riak’s MVRegister |
| OR-Set (observed-remove set) | sets w/ add + remove | Riak, Akka Distributed Data, Redis Enterprise |
| LWW-Element-Set | set w/ add + remove timestamp tie-break | Riak |
| RGA (Replicated Growable Array) | ordered list | collaborative text (Yjs, Automerge) |
| Logoot / LSEQ | ordered list | collaborative text |
| Yjs Y.Doc | tree-of-CRDTs | Liveblocks, Tldraw, Tiptap, JupyterLab |
| Automerge | document JSON CRDT | Local-first apps |
| Causal-tree (Grishchenko) | text editing | early CRDT text work |
Modern collaborative apps (Figma, Linear, Notion-text, Replit, Tldraw) all use CRDTs for offline-first / multi-user text. The flip side is CRDTs cannot enforce global invariants — you cannot have “bank balance ≥ 0” in pure CRDT without coordination.
5. The “session” subset — practical mid-tier
Session models (Terry et al, “Session Guarantees for Weakly Consistent Replicated Data” 1994) include:
- Read-your-writes (RYW) — every read sees the requester’s own previous writes.
- Monotonic reads (MR) — successive reads of the same data never go backwards in time.
- Monotonic writes (MW) — writes from the same session apply in issue order.
- Writes-follow-reads (WFR) — writes always come after the reads they depend on.
These four are strictly weaker than causal but practically sufficient for most user-facing UIs — a user types, sees their own change, refreshes, sees their own change. MongoDB’s “causal consistency” (introduced 3.6, 2017) and Cosmos DB’s “session” consistency are session models in Terry’s sense.
6. Postgres isolation levels — the SQL reality
Postgres explicitly defines four levels but promotes them:
| Requested | Actually delivered |
|---|---|
| Read Uncommitted | Read Committed (Postgres has no dirty reads anyway) |
| Read Committed | Read Committed (default) |
| Repeatable Read | Snapshot Isolation |
| Serializable | Serializable Snapshot Isolation (SSI, Kemme-Alonso-Ports 2012) |
synchronous_commit = on/local/remote_write/remote_apply tunes the durability/availability trade-off; remote_apply is the strongest synchronous-replication mode.
7. MySQL InnoDB isolation — the practical gotcha
InnoDB’s “REPEATABLE READ” is not SQL-standard repeatable read — it permits phantoms in certain SELECT…FOR UPDATE patterns. The actually-serializable mode is SERIALIZABLE, which adds shared locks on all reads (high contention). Most MySQL applications run at REPEATABLE READ.
Galera Cluster (MariaDB Cluster) provides virtually synchronous replication via certification — each transaction at commit is broadcast to all nodes and certified against concurrent writes; if certification fails, the transaction aborts. Effectively serializable across the cluster.
8. Consensus protocols — the substrate
| Protocol | Used by | Notes |
|---|---|---|
| Paxos (Lamport 1998) | Spanner (Multi-Paxos), original implementation everywhere | hard to implement correctly |
| Multi-Paxos | Spanner, Chubby, Cassandra LWT | optimization over single-decree Paxos |
| Raft (Ongaro-Ousterhout 2014) | etcd, Consul, CockroachDB, YugabyteDB, TiKV, MongoDB replica sets, Aerospike SC, NATS JetStream, ScyllaDB Raft (since 5.0) | designed for understandability |
| ZAB (ZooKeeper Atomic Broadcast) | ZooKeeper | sequential consistency only |
| EPaxos (Egalitarian Paxos, Moraru-Andersen-Kaminsky 2013) | research / experimental | leaderless, lower latency in WAN |
| Flexible Paxos (Howard-Schwarzkopf-Madhavapeddy-Crowcroft 2017) | derivative tooling | weakens quorum size requirements |
| Calvin (Yale 2012) | Calvin DB | sequencer pre-orders, deterministic execution |
| Tendermint BFT (Buchman 2016) | Cosmos, blockchain | Byzantine-fault-tolerant Paxos variant |
| PBFT (Castro-Liskov 1999) | Hyperledger Fabric, some blockchain | f Byzantine faults tolerated out of 3f+1 |
| HotStuff (Yin-Malkhi-Reiter-Gueta-Abraham 2019) | Diem (formerly Libra), several blockchain | linear view-change cost |
consensus-protocols details each.
9. The “external consistency” guarantee (Spanner)
Spanner gives external consistency = linearizability with the additional property that if T1 commits before T2 begins (in real time), then T1’s effects are visible at T2’s snapshot. This is stronger than linearizability when sequences of independent clients run.
The trick: TrueTime — a globally synchronized clock with bounded error ε (~7 ms in Google’s datacenters via GPS + atomic clocks). A commit waits out the uncertainty window so its timestamp is guaranteed to be in the past for any future operation. ε is the price; the win is that read-only transactions can run at a stale timestamp without coordination.
Spanner-style systems without TrueTime use Hybrid Logical Clocks (HLC, Kulkarni-Demirbas-Madappa 2014) — combines physical NTP time with a Lamport-counter tiebreaker. CockroachDB, YugabyteDB, MongoDB, TiKV all use HLC variants.
10. Decision tree — pick a consistency model
What's the workload?
├─ Financial transactions, ledger, bank account, inventory
│ → Linearizable / external consistency
│ → Spanner, CockroachDB, FoundationDB, Aurora DSQL
│ → Trade off: 5–50 ms write latency in geo-distributed setup
├─ E-commerce cart / order placement (single region)
│ → Serializable (SSI)
│ → Postgres, MySQL InnoDB SERIALIZABLE
│ → 1–5 ms writes
├─ Session state, shopping cart (cross-region, low-latency)
│ → Read-your-writes (session model)
│ → Cosmos DB Session, MongoDB causal session, DynamoDB strong+sessioned
├─ Collaborative document editor (Figma, Notion, Linear)
│ → Strong-eventual via CRDT
│ → Yjs, Automerge, Liveblocks
│ → No global invariants — conflicts merge automatically
├─ Social-feed timeline
│ → Eventual
│ → Cassandra, DynamoDB eventual, MongoDB read=local
│ → Sub-ms reads in exchange for stale data window
├─ Time-series ingest (metrics, logs)
│ → Eventual; per-series order
│ → InfluxDB, Prometheus, ClickHouse, TimescaleDB
├─ Distributed lock / leader-election / config
│ → Linearizable
│ → etcd, ZooKeeper, Hazelcast CP subsystem, Consul
├─ Event log / event sourcing source-of-truth
│ → Per-partition total order
│ → Kafka, Pulsar, Kinesis, Redpanda
│ → Use exactly-once semantics if cross-partition matters
├─ Caching layer
│ → Eventual (stale OK by definition)
│ → Redis Cluster, Memcached
├─ Vector search (RAG, embeddings)
│ → Eventual with freshness window
│ → Pinecone, Weaviate, Milvus, Qdrant, pgvector
├─ Multi-region with strong needs
│ → External consistency, accept higher latency
│ → Spanner, CockroachDB, Aurora DSQL
└─ Multi-region with low-latency needs
→ Causal / RYW + per-region linearizable
→ MongoDB causal session, Cosmos DB Bounded Staleness
→ DynamoDB Global Tables (LWW eventual)
11. Anti-patterns — the four common mistakes
- Mistaking InnoDB “REPEATABLE READ” for actual repeatable read — phantoms exist. Use SERIALIZABLE or compensate with SELECT…FOR UPDATE.
- Trusting DynamoDB “strongly consistent” reads across partitions — strong consistency is per-item only. Cross-item ACID requires
TransactWrite/TransactGet. - Assuming MongoDB writes are durable on default config — historically
w:1was default; onlyw:majorityis durable across replica set. As of MongoDB 5.0, default isw:majority, but check. - Using CRDTs for invariants that need coordination — bank balance ≥ 0, unique username, exactly-once seat reservation. CRDTs converge; they do not coordinate. You need a consensus protocol on top.
12. The 2024–2026 frontier
- Spanner-class on commodity — CockroachDB 24.x, YugabyteDB 2024+, TiDB 8.x, FoundationDB Apple now ship Spanner-style external consistency on commodity hardware via HLC.
- Aurora DSQL (re:Invent 2024) — active-active multi-region Postgres with strong consistency, journal-based.
- Distributed SQLite (rqlite, Turso/libSQL, Litestream) — embedded DB scaled out via Raft (rqlite) or log replication (Turso libSQL Server).
- ScyllaDB Raft tables (5.x) — tables can now be strongly consistent (Raft-backed) alongside Cassandra-style eventual tables.
- Postgres logical replication for active-active — pglogical, Bucardo, BDR (2ndQuadrant / EDB), now Postgres 16 logical decoding improvements.
- PostgreSQL serializable snapshot isolation (SSI) — production-proven at scale (Heroku Postgres, Crunchy Bridge, etc.).
- CRDTs in production — Figma’s multiplayer system, Liveblocks (Yjs as a service, ~$25M Series A 2024), Replicache (sync layer for local-first), Replit’s collaborative editor.
Adjacent
- Math foundations — markov-chains-and-hmm for distributed clock theory; probability-fundamentals for failure-rate modeling underlying availability.
- Cryptography — cryptography-fundamentals for the digital-signature primitives underlying BFT consensus.
- Distributed systems theory — distributed-systems-fundamentals for FLP impossibility, async/sync model, consensus theorems.
- Storage engines — database-internals / databases-internals-deep / database-engine-taxonomy for the engines underneath.
- Service architecture — _compare_service-architectures for how service decomposition interacts with consistency boundaries.
- CRDTs — crdts-and-distributed-data-types for full coverage.
- Microservices — microservices-patterns for saga, event sourcing, outbox, distributed transactions across services.
When to pick what
The fastest narrowing: money / inventory → linearizable; session UI state → causal or RYW; collaborative editing → CRDT (strong-eventual); feed / cache / metrics → eventual; locks / config → linearizable consensus (etcd/ZK). Geography flips the dial — multi-region linearizable pays 50+ ms per write (Spanner, CockroachDB across continents); regional + eventual cross-region is the most common modern pattern. The single biggest cost is the default isolation level — pick deliberately, document it, and test under partition with chaos engineering (Jepsen-style tests, Aphyr/Kingsbury’s work on every system above).