Consistency Models — Cross-System Comparison

This note compares the consistency model offered by every database, datastore, and distributed-data system referenced in the Compute library. Each section pins the model on a single axis — linearizability ↓ eventual — then maps real systems to it. Use the tables to look up “what does System X actually promise”; use the decision tree to pick a model for a new workload.

See also

1. The consistency hierarchy

strongest                                                              weakest
  |                                                                       |
linearizable → sequential → causal+ → causal → RYW → MR → MW → SE → eventual
  Spanner       traditional   COPS    Bayou   sessions  monotonic  CRDTs   DynamoDB
  Calvin        single-node   Eiger             reads    writes    Riak    Cassandra
  etcd/Raft     in-mem DB                                                  (default)
  ZooKeeper
  • Linearizable (Herlihy-Wing 1990) — operations appear atomic, in some real-time order. Strongest.
  • Sequential (Lamport 1979) — operations appear in some total order, agreed by all processes; does not need to be real-time.
  • Causal+ (Lloyd-Freedman-Kaminsky-Andersen, COPS, SOSP 2011) — causal consistency + convergent conflict handling. The strongest model achievable under partition + low latency.
  • Causal (Hutto-Ahamad 1990) — operations that are causally related are seen in causal order; concurrent ops can be observed in different orders.
  • Read-your-writes (RYW) — a process always sees its own writes (session model).
  • Monotonic read — successive reads see ≥ previous read’s version.
  • Monotonic write — process’s own writes are applied in issue-order.
  • Strong-eventual (SEC) (Shapiro-Preguiça-Baquero-Zawirski 2011) — convergent state given any delivery order (CRDT-style).
  • Eventual (Vogels 2008 ACM Queue) — if no new writes, all replicas converge.
  • PRAM / FIFO (Pipelined RAM, Lipton-Sandberg 1988) — writes by one process seen in issue order by all others; concurrent writes can interleave.

2. CAP, PACELC — the framing

CAP (Brewer 2000 PODC keynote; Gilbert-Lynch 2002 proof) — under network partition, a system must choose C (consistency = linearizability) or A (availability = every request answered). Never both.

PACELC (Daniel Abadi 2010) — extends CAP. If a Partition: A vs C. Else (no partition): Latency vs Consistency. The L vs C tradeoff is the practical one — partitions are rare; latency-vs-consistency is paid every single operation.

SystemPartition behaviorElse behaviorClassification
SpannerCPCC (TrueTime adds bounded staleness)CP/CC
Cassandra (QUORUM)APEL (eventual on low latency)AP/EL
DynamoDB (strong)CPCCCP/CC
DynamoDB (eventual)APELAP/EL
MongoDB (default w=majority)CP-ish (configurable)EC or EL depending on readConcernCP/EC
Cosmos DBCP/AP per consistency levelconfigurable (5 levels)configurable
FoundationDBCP (strict serializable)CCCP/CC
CockroachDBCP (serializable)CCCP/CC
TiDBCP (snapshot isolation by default)CCCP/CC
YugabyteDBCPCCCP/CC
RiakAPELAP/EL
AerospikeAP (default) or CP (strong-consistency mode)ELAP/EL or CP/CC
Redis (single-node)n/an/a (linearizable single-instance)linearizable
Redis ClusterAPEL (async replication)AP/EL
Etcd (Raft)CPCC (linearizable reads)CP/CC
ZooKeeper (ZAB)CPsequential consistencyCP/CC (sequential not linearizable)

3. Database family → consistency model

Mapping every system in database-internals and database-engine-taxonomy.

3.1 Relational OLTP

SystemDefault isolationStrictest availableReplication modeNotes
PostgreSQLRead CommittedSerializable (SSI, Kemme-Alonso-Ports 2012)sync (synchronous_commit) or async streamingSSI is the gold standard for predicate-aware serializability
MySQL InnoDBREPEATABLE-READ (note: not actual repeatable read — has phantoms)SERIALIZABLEasync binlog, semi-sync, group replication (MGR, 8.0)InnoDB’s RR is MVCC w/ snapshot at first read
MySQL Galera (MariaDB Cluster)SERIALIZABLE per grouplinearizable across clustersynchronous certificationvirtually synchronous; group-commit
SQL ServerRead Committed (default)Serializablealways-on availability groups (sync/async)also has Snapshot isolation
OracleRead CommittedSerializableActive Data Guard (sync/async)does NOT have true SERIALIZABLE (SSI-like) historically; only snapshot
SQLiteSerializable (single-writer, single-process)Serializablen/a (embedded)trivially linearizable on disk via WAL
Aurora (Postgres / MySQL)inherits engineinherits engine6/6 quorum storage, 4/6 write, 3/6 readstorage layer is decoupled; compute layer single-master (until Aurora DSQL 2024 multi-region multi-master)

3.2 NewSQL (geo-distributed)

SystemDefault isolationReplicationMechanism
Google Spannerexternal consistency (linearizable + bounded staleness)Paxos across replicasTrueTime (Marzullo + GPS + atomic clocks, ~7 ms epsilon)
Google Spanner stale readsbounded staleness (configurable)PaxosTrueTime + read timestamp
CockroachDBserializablemulti-Raft per rangeHLC (hybrid logical clock, Kulkarni-Demirbas-Madappa 2014)
YugabyteDBserializable + snapshot isolationRaft per tabletHLC-based
TiDBsnapshot isolation (default)multi-Raft per regionPD-issued global timestamp via TSO oracle
FoundationDBstrict serializablePaxos + state-machine replicationresolver-coordinated transaction read-version
Calvin (Yale, Thomson-Diamos-Weng-Ren-Shao-Abadi 2012)strict serializabledeterministic order across replicassequencer pre-orders before execution
Amazon Aurora DSQL (re:Invent 2024)strong (linearizable)active-active multi-regionjournal-based; per-row Raft-ish

3.3 Key-value / wide-column

SystemDefaultStrongest availableReplication
DynamoDBeventually consistent readsstrongly consistent reads (single-AZ)three-AZ sync; transaction tables for atomic multi-item
Cassandratunable (ONE/QUORUM/ALL)QUORUM read+write → quorum-consistencygossip + LWT (Paxos-based) for compare-and-set
ScyllaDBtunable (Cassandra-compatible)QUORUM, LWTseastar-based, low-latency Cassandra clone
Riak KVtunable (R+W>N for “strong”)eventual; with Dotted Version Vectors (Almeida et al 2014)active anti-entropy; AAE merkle trees
Aerospike (default AP mode)eventualper-record consistencyreplication factor; strong-consistency mode (SC) since 4.0 (Paxos)
Aerospike (SC mode)strict serializablelinearizable single-record read/writeRoster-based; sync replication
etcdlinearizable reads (default) + lease-based watcheslinearizableRaft
ZooKeepersequential consistencysequential + linearizable writesZAB
Hazelcast IMaptunable (read-from-backup vs leader)linearizableCP subsystem (Raft) since 3.12
Apache HBasestrong per-row, eventual cross-rowstrong per-rowHDFS-backed; HMaster + RegionServer
Bigtablestrong per-rowstrong per-row + atomicColossus-backed

3.4 Document

SystemDefault consistencySessions / causalNotes
MongoDBmajority writeConcern (default ≥ 5.0)causal-consistency session w/ afterClusterTimelogical clocks (cluster time)
Couchbasetunable (per-bucket durability levels: NONE/MAJORITY/MAJORITY-AND-PERSIST-ACTIVE/PERSIST-MAJORITY)session consistency via N1QL hintsRYW per session
CouchDBeventualn/arevision-tree merge resolution
AWS DocumentDBstrong (Aurora-like storage)n/adistinct from Mongo despite wire-compat

3.5 Time-series

SystemConsistencyNotes
InfluxDBeventual (open source)tags + fields + retention buckets
InfluxDB Enterprise / Cloudtunablehinted handoff anti-entropy
TimescaleDBinherits Postgres (Serializable available)hypertables on Postgres
Prometheuslocal-only (no replication)scrape-based; Thanos / Cortex / Mimir add replication
Mimir / Thanos / Cortexconfigurabletypically eventual via S3 backing
Druideventual (segment-level)indexer + historical pattern
ClickHousetunable; default async; INSERT_QUORUM for syncRANS / Atomic / Replicated engines

3.6 Graph

SystemConsistencyNotes
Neo4jstrict serializable single-instance, causal-cluster RYWRaft for causal clustering
TigerGraphstrong per-machine, eventual clusterpartition-tolerant by design
ArangoDBtunableRocksDB engine; resilient single-server replication
JanusGraphdepends on backend (Cassandra → tunable; HBase → strong per-row; Bigtable → strong per-row)
Amazon Neptunestrong reads (read replicas eventual w/ option for read-after-write)shared storage like Aurora

3.7 Vector

SystemConsistencyNotes
Pineconestrong index reads; freshness window for writes”freshness ~30s by default”
Weaviatetunable, similar to Cassandra (consistency-level per operation)Raft-based metadata, async vector replication
Milvuseventual; bounded staleness optionstrong via BOUNDED consistency level
Qdranteventual; configurablegossip cluster
pgvector (in Postgres)inherits Postgres (Serializable available)depends on cluster setup
Chromasingle-node (linearizable trivially); cluster-mode emerging

3.8 In-memory caches

SystemConsistencyNotes
Redis (standalone)linearizable single-instancesingle-threaded command loop
Redis ClusterAP / eventualasync replication; sentinel for failover
Redis Enterprisetunable; Active-Active with CRDT semanticsper-key CRDT; “EAC” (eventual active-active consistency)
Memcachedbest-effort; client-side hashingno replication
Hazelcast (CP subsystem)linearizableRaft

3.9 Streaming / log

SystemConsistencyNotes
Kafkaper-partition total order; configurable producer acks (0/1/all/-1)ISR-based; min.insync.replicas for durability
Kafka transactionalexactly-once across partitionsproducer ID + epoch; idempotent producers
Pulsarper-partition strong; cross-partition eventualBookKeeper-backed ledger w/ quorum write
RabbitMQ Streamsper-stream strongmirroring + Raft (since 3.10) for quorum queues
AWS Kinesisper-shard ordershard iterator; trimming horizon
RedpandaKafka-compatible; Raft per partitionlow-latency engine, single binary

4. CRDTs — the strong-eventual family

crdts-and-distributed-data-types details these. CRDTs achieve strong-eventual consistency by structuring operations to be:

  • Commutative — order doesn’t matter (e.g., set-union, max counter).
  • Associative — grouping doesn’t matter.
  • Idempotent — applying twice = once.

This means any delivery order converges to the same state. No coordination, no consensus — but the data type’s algebra is constrained.

CRDTUseWhere it ships
G-Counter (grow-only)countersRiak
PN-Counter (positive + negative)counters supporting decrementRiak, Redis Enterprise
LWW-Registerlast-write-wins single valueCassandra columns, Cosmos DB session
Multi-Value Registerconcurrent writes preservedRiak’s MVRegister
OR-Set (observed-remove set)sets w/ add + removeRiak, Akka Distributed Data, Redis Enterprise
LWW-Element-Setset w/ add + remove timestamp tie-breakRiak
RGA (Replicated Growable Array)ordered listcollaborative text (Yjs, Automerge)
Logoot / LSEQordered listcollaborative text
Yjs Y.Doctree-of-CRDTsLiveblocks, Tldraw, Tiptap, JupyterLab
Automergedocument JSON CRDTLocal-first apps
Causal-tree (Grishchenko)text editingearly CRDT text work

Modern collaborative apps (Figma, Linear, Notion-text, Replit, Tldraw) all use CRDTs for offline-first / multi-user text. The flip side is CRDTs cannot enforce global invariants — you cannot have “bank balance ≥ 0” in pure CRDT without coordination.

5. The “session” subset — practical mid-tier

Session models (Terry et al, “Session Guarantees for Weakly Consistent Replicated Data” 1994) include:

  • Read-your-writes (RYW) — every read sees the requester’s own previous writes.
  • Monotonic reads (MR) — successive reads of the same data never go backwards in time.
  • Monotonic writes (MW) — writes from the same session apply in issue order.
  • Writes-follow-reads (WFR) — writes always come after the reads they depend on.

These four are strictly weaker than causal but practically sufficient for most user-facing UIs — a user types, sees their own change, refreshes, sees their own change. MongoDB’s “causal consistency” (introduced 3.6, 2017) and Cosmos DB’s “session” consistency are session models in Terry’s sense.

6. Postgres isolation levels — the SQL reality

Postgres explicitly defines four levels but promotes them:

RequestedActually delivered
Read UncommittedRead Committed (Postgres has no dirty reads anyway)
Read CommittedRead Committed (default)
Repeatable ReadSnapshot Isolation
SerializableSerializable Snapshot Isolation (SSI, Kemme-Alonso-Ports 2012)

synchronous_commit = on/local/remote_write/remote_apply tunes the durability/availability trade-off; remote_apply is the strongest synchronous-replication mode.

7. MySQL InnoDB isolation — the practical gotcha

InnoDB’s “REPEATABLE READ” is not SQL-standard repeatable read — it permits phantoms in certain SELECT…FOR UPDATE patterns. The actually-serializable mode is SERIALIZABLE, which adds shared locks on all reads (high contention). Most MySQL applications run at REPEATABLE READ.

Galera Cluster (MariaDB Cluster) provides virtually synchronous replication via certification — each transaction at commit is broadcast to all nodes and certified against concurrent writes; if certification fails, the transaction aborts. Effectively serializable across the cluster.

8. Consensus protocols — the substrate

ProtocolUsed byNotes
Paxos (Lamport 1998)Spanner (Multi-Paxos), original implementation everywherehard to implement correctly
Multi-PaxosSpanner, Chubby, Cassandra LWToptimization over single-decree Paxos
Raft (Ongaro-Ousterhout 2014)etcd, Consul, CockroachDB, YugabyteDB, TiKV, MongoDB replica sets, Aerospike SC, NATS JetStream, ScyllaDB Raft (since 5.0)designed for understandability
ZAB (ZooKeeper Atomic Broadcast)ZooKeepersequential consistency only
EPaxos (Egalitarian Paxos, Moraru-Andersen-Kaminsky 2013)research / experimentalleaderless, lower latency in WAN
Flexible Paxos (Howard-Schwarzkopf-Madhavapeddy-Crowcroft 2017)derivative toolingweakens quorum size requirements
Calvin (Yale 2012)Calvin DBsequencer pre-orders, deterministic execution
Tendermint BFT (Buchman 2016)Cosmos, blockchainByzantine-fault-tolerant Paxos variant
PBFT (Castro-Liskov 1999)Hyperledger Fabric, some blockchainf Byzantine faults tolerated out of 3f+1
HotStuff (Yin-Malkhi-Reiter-Gueta-Abraham 2019)Diem (formerly Libra), several blockchainlinear view-change cost

consensus-protocols details each.

9. The “external consistency” guarantee (Spanner)

Spanner gives external consistency = linearizability with the additional property that if T1 commits before T2 begins (in real time), then T1’s effects are visible at T2’s snapshot. This is stronger than linearizability when sequences of independent clients run.

The trick: TrueTime — a globally synchronized clock with bounded error ε (~7 ms in Google’s datacenters via GPS + atomic clocks). A commit waits out the uncertainty window so its timestamp is guaranteed to be in the past for any future operation. ε is the price; the win is that read-only transactions can run at a stale timestamp without coordination.

Spanner-style systems without TrueTime use Hybrid Logical Clocks (HLC, Kulkarni-Demirbas-Madappa 2014) — combines physical NTP time with a Lamport-counter tiebreaker. CockroachDB, YugabyteDB, MongoDB, TiKV all use HLC variants.

10. Decision tree — pick a consistency model

What's the workload?
├─ Financial transactions, ledger, bank account, inventory
│    → Linearizable / external consistency
│    → Spanner, CockroachDB, FoundationDB, Aurora DSQL
│    → Trade off: 5–50 ms write latency in geo-distributed setup
├─ E-commerce cart / order placement (single region)
│    → Serializable (SSI)
│    → Postgres, MySQL InnoDB SERIALIZABLE
│    → 1–5 ms writes
├─ Session state, shopping cart (cross-region, low-latency)
│    → Read-your-writes (session model)
│    → Cosmos DB Session, MongoDB causal session, DynamoDB strong+sessioned
├─ Collaborative document editor (Figma, Notion, Linear)
│    → Strong-eventual via CRDT
│    → Yjs, Automerge, Liveblocks
│    → No global invariants — conflicts merge automatically
├─ Social-feed timeline
│    → Eventual
│    → Cassandra, DynamoDB eventual, MongoDB read=local
│    → Sub-ms reads in exchange for stale data window
├─ Time-series ingest (metrics, logs)
│    → Eventual; per-series order
│    → InfluxDB, Prometheus, ClickHouse, TimescaleDB
├─ Distributed lock / leader-election / config
│    → Linearizable
│    → etcd, ZooKeeper, Hazelcast CP subsystem, Consul
├─ Event log / event sourcing source-of-truth
│    → Per-partition total order
│    → Kafka, Pulsar, Kinesis, Redpanda
│    → Use exactly-once semantics if cross-partition matters
├─ Caching layer
│    → Eventual (stale OK by definition)
│    → Redis Cluster, Memcached
├─ Vector search (RAG, embeddings)
│    → Eventual with freshness window
│    → Pinecone, Weaviate, Milvus, Qdrant, pgvector
├─ Multi-region with strong needs
│    → External consistency, accept higher latency
│    → Spanner, CockroachDB, Aurora DSQL
└─ Multi-region with low-latency needs
     → Causal / RYW + per-region linearizable
     → MongoDB causal session, Cosmos DB Bounded Staleness
     → DynamoDB Global Tables (LWW eventual)

11. Anti-patterns — the four common mistakes

  1. Mistaking InnoDB “REPEATABLE READ” for actual repeatable read — phantoms exist. Use SERIALIZABLE or compensate with SELECT…FOR UPDATE.
  2. Trusting DynamoDB “strongly consistent” reads across partitions — strong consistency is per-item only. Cross-item ACID requires TransactWrite / TransactGet.
  3. Assuming MongoDB writes are durable on default config — historically w:1 was default; only w:majority is durable across replica set. As of MongoDB 5.0, default is w:majority, but check.
  4. Using CRDTs for invariants that need coordination — bank balance ≥ 0, unique username, exactly-once seat reservation. CRDTs converge; they do not coordinate. You need a consensus protocol on top.

12. The 2024–2026 frontier

  • Spanner-class on commodity — CockroachDB 24.x, YugabyteDB 2024+, TiDB 8.x, FoundationDB Apple now ship Spanner-style external consistency on commodity hardware via HLC.
  • Aurora DSQL (re:Invent 2024) — active-active multi-region Postgres with strong consistency, journal-based.
  • Distributed SQLite (rqlite, Turso/libSQL, Litestream) — embedded DB scaled out via Raft (rqlite) or log replication (Turso libSQL Server).
  • ScyllaDB Raft tables (5.x) — tables can now be strongly consistent (Raft-backed) alongside Cassandra-style eventual tables.
  • Postgres logical replication for active-active — pglogical, Bucardo, BDR (2ndQuadrant / EDB), now Postgres 16 logical decoding improvements.
  • PostgreSQL serializable snapshot isolation (SSI) — production-proven at scale (Heroku Postgres, Crunchy Bridge, etc.).
  • CRDTs in production — Figma’s multiplayer system, Liveblocks (Yjs as a service, ~$25M Series A 2024), Replicache (sync layer for local-first), Replit’s collaborative editor.

Adjacent

When to pick what

The fastest narrowing: money / inventory → linearizable; session UI state → causal or RYW; collaborative editing → CRDT (strong-eventual); feed / cache / metrics → eventual; locks / config → linearizable consensus (etcd/ZK). Geography flips the dial — multi-region linearizable pays 50+ ms per write (Spanner, CockroachDB across continents); regional + eventual cross-region is the most common modern pattern. The single biggest cost is the default isolation level — pick deliberately, document it, and test under partition with chaos engineering (Jepsen-style tests, Aphyr/Kingsbury’s work on every system above).