Database Internals Deep — Storage, Concurrency, Distribution
A database system implements durable, queryable, and concurrent state. Modern systems compose four major subsystems: a storage engine that persists data with crash safety and efficient access; a transaction manager that provides isolation guarantees among concurrent operations; a query processor that translates declarative requests into physical plans; and, for distributed systems, a replication and consensus layer that maintains agreement across machines despite failures. This note goes deeper than introductory coverage, focusing on the engines and algorithms that distinguish PostgreSQL, MySQL/InnoDB, Oracle, SQL Server, RocksDB, WiredTiger, DuckDB, ClickHouse, Spanner, CockroachDB, Yugabyte, FoundationDB, Aurora, and similar systems in production.
See also
- database-internals — Tier-1 companion overview.
- sql-nosql-design — surface model and schema patterns.
- distributed-systems-fundamentals — distribution foundations the section 6 layer depends on.
- consensus-protocols — Raft and Paxos used for replication.
- crdts-and-distributed-data-types — alternative to consensus for some workloads.
- observability-stack — query and storage observability.
- cpu-cache-performance — vectorized execution rests on cache-aware code.
- networking-foundations — RDMA and HTAP communication costs.
- database-engine-taxonomy — engines catalog.
1. Storage engines
The storage engine maps logical pages or rows to durable media (HDD, SSD, NVMe, NVRAM) and serves point and range reads with predictable latency. The dominant on-disk structures are the B+ tree (and variants) and the log-structured merge tree (LSM).
1.1 B+ trees
A B+ tree is a balanced multi-way search tree where all values are stored in leaves and internal nodes hold only routing keys. Nodes correspond to disk pages — typically 4 KiB (the OS page size in Linux), 8 KiB (PostgreSQL default), 16 KiB (InnoDB default), or 32 KiB. Tree height grows logarithmically with O(log_B N) levels for a B-branching tree on N keys; with B ~ 100–500 routes per node, even billion-row tables fit in 4–5 levels.
Leaf pages are linked sideways into a doubly linked list to support range scans without re-descent. Insertions split full pages; deletions may merge or redistribute keys. Bayer-McCreight 1972 original B-tree; Comer 1979 ACM Computing Surveys review; Knuth Vol 3.
Production refinements:
- PostgreSQL B-tree (Lehman-Yao 1981). Right-link pointers permit concurrent split-aware traversal; no lock coupling. The “btree-gin” and “btree-gist” variants extend B-trees to more types.
- InnoDB clustered index. Primary key determines physical row order; secondary indexes carry the primary key as the row pointer. The clustered storage means index lookups by secondary key require a second probe through the clustered index (Mohan 1992; Antognini Troubleshooting Oracle Performance 2nd ed.).
- Oracle index-organized tables (IOTs). Equivalent of InnoDB clustering as an explicit option per table.
- SQL Server clustered and nonclustered indexes. Same physical-vs-logical pointer distinction.
Concurrency on B+ trees:
- Lock coupling (crab latching). Acquire latch on child before releasing parent. Simple but limits concurrency.
- Lehman-Yao right-links + optimistic concurrency. Allow concurrent splits; traverse right-link if a search key now belongs to the right sibling.
- Bw-tree (Levandoski-Lomet-Sengupta 2013 ICDE). Latch-free B-tree via mapping table and delta records; used in Microsoft Hekaton, FASTER. Delta consolidation reclaims memory.
- OLFIT (Optimistic Lock-Free Indexing using Tokens, Cha-Hwang-Kim-Kwon 2001 VLDB). Versioned latches with optimistic readers.
- Bε-tree. B-tree with buffered batch insertions to amortize write cost; used by TokuDB/TokuMX/PerconaFT.
1.2 LSM trees
Log-structured merge trees (O’Neil-Cheng-Gawlick-O’Neil 1996 Acta Informatica 33) buffer writes in memory (the memtable) and periodically flush sorted runs (sstables) to disk; background compaction merges overlapping levels. LSMs optimize for write throughput at the cost of read amplification.
- LevelDB (Ghemawat-Dean 2011). Original Google embedded LSM. Inspired most modern LSMs.
- RocksDB (Dong-Callaghan-Galanis-Borthakur-Savor-Strum 2017 FAST). Facebook fork of LevelDB; configurable compaction strategies (leveled, universal, FIFO), column families, transaction support, prefix Bloom filters. Powers MyRocks (Facebook), CockroachDB Pebble fork-ancestor, ScyllaDB, TiKV (TiDB), ArangoDB, EventStoreDB. Default compression: Zstd levels 1–22.
- Pebble (Cockroach Labs 2018+). Pure-Go RocksDB-compatible LSM optimized for CockroachDB; sstables read-compatible with RocksDB.
- WiredTiger. MongoDB’s default since 3.2; B-tree by default but supports LSM mode. Snappy compression standard.
- HBase, Cassandra, ScyllaDB, AerospikeDB. LSM-based KV / wide-column stores.
- Bitcask (Riak). Hash-indexed log-structured store with in-memory key directory; recovery via hint files.
Compaction strategies:
- Leveled compaction. L_i has total size ~10 × L_{i-1}; each level is internally sorted with non-overlapping ranges; flush from memtable → L0 (overlapping permitted), then partial merges between adjacent levels.
- Universal/tiered compaction (RocksDB universal, Cassandra “size-tiered”). Larger merge units; lower write amplification but higher read amplification and larger space amplification.
- FIFO compaction. Time-series; drop oldest sstables when size limit hit. Used by RocksDB for log workloads.
- Tiered + leveled hybrid (Dostoevsky, Dayan-Athanassoulis-Idreos 2018 SIGMOD). Optimizes for given workload.
Read amplification, write amplification, space amplification (RUM conjecture, Athanassoulis-Kester-Maas-Stoica-Idreos-Ailamaki-Callaghan 2016 EDBT) cannot all be minimized simultaneously — engines tune to workload.
Bloom filters and tiered Bloom filters reduce point-read amplification by filtering out sstables that cannot contain a key. RocksDB partitioned filters and ribbon filters (Dillinger-Walzer 2021) shrink filter memory.
1.3 Heap files and slotted pages
Some engines (PostgreSQL, Oracle, SQL Server heaps) store rows in unsorted heap files and reference rows through tuple IDs (ctid in PostgreSQL, ROWID in Oracle, RID in SQL Server). Pages use a slotted page structure (also called PAX or NSM-on-page): a fixed header, an array of slot pointers, and tuple bodies growing from the page end. New tuples are placed in any page with sufficient free space; deletions mark slots free and rely on vacuum to compact.
Slotted pages serve as the physical layout under all major row-store engines including PostgreSQL, InnoDB (within page), Oracle, SQL Server, Db2.
1.4 Columnar storage formats
Analytical databases store data column-by-column rather than row-by-row, enabling compression and CPU-friendly access for aggregation queries that touch few columns.
- Apache Parquet (Twitter-Cloudera 2013). On-disk columnar format. Each row group (default ~128 MiB) contains column chunks; chunks contain pages with optional dictionary encoding, RLE, bit-packing, delta encoding. Compression: Snappy default, Zstd/Gzip/LZ4/Brotli optional. Metadata footer for predicate pushdown via min/max statistics per page.
- Apache ORC (Hortonworks 2013). Optimized Row Columnar; similar to Parquet with row groups (stripes), built-in bloom filters, integrated indexing. Default in Hive.
- Apache Arrow (Wes McKinney 2016+). In-memory columnar format; cross-language zero-copy interchange; Arrow Compute for vectorized kernels; Arrow Flight for high-throughput RPC. ADBC for database connectivity.
- Apache Iceberg, Apache Hudi, Delta Lake. Open table formats layering ACID transactions, schema evolution, time travel, and partition evolution over Parquet/ORC files in object storage. Iceberg uses a manifest-list of manifest files referring to data files; Delta Lake uses a transaction log of JSON commits. Iceberg V2 (~2022) and V3 (2024) add row-level deletes and equality deletes for streaming workloads.
- DuckDB native format. Custom columnar storage with FOR (frame-of-reference) compression, bit-packing.
- ClickHouse MergeTree. Per-column files, granule-based indexing, sparse primary index. Compression default LZ4, optional Zstd, Delta+LZ4 for monotonic columns.
2. Transaction management
Concurrency control ensures correct execution under concurrent transactions. The four ANSI SQL isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) are defined by anomalies they exclude (dirty reads, non-repeatable reads, phantom reads). The 1995 Berenson-Bernstein-Gray-Melton-O’Neil-O’Neil critique (“A Critique of ANSI SQL Isolation Levels”) showed the standard is ambiguous and identified missing anomalies (lost update, read skew, write skew, phantoms-by-predicate). Modern engines provide multiple isolation levels with engine-specific semantics.
2.1 Two-phase locking
Eswaran-Gray-Lorie-Traiger 1976 Communications of the ACM — acquire all locks before releasing any. Strict 2PL holds exclusive locks until commit (no cascading aborts); rigorous 2PL holds all locks until commit. Lock modes: shared (S), exclusive (X), update (U, Oracle/SQL Server, lock for read+possible write to avoid deadlock); intent locks (IS, IX, SIX) for hierarchical locking (Gray-Lorie-Putzolu-Traiger 1976). Multi-granularity locking allows table-level, page-level, row-level coexistence.
Deadlock handling: wait-die / wound-wait (Rosenkrantz-Stearns-Lewis 1978) timestamp ordering, deadlock detection via wait-for graph cycle search (most production systems; periodic check), or timeout. Phantom locks via predicate locks (Eswaran et al.) or next-key locks (InnoDB).
2.2 Multiversion concurrency control
MVCC keeps multiple versions of each row visible to different transactions, enabling readers to proceed without blocking writers and vice versa. Snapshot Isolation (Berenson et al. 1995) — each transaction reads a consistent snapshot at its start time; writes conflict only on update-update.
Implementations:
- PostgreSQL. Each row has xmin (creating txn ID) and xmax (deleting/updating txn ID). Versions are added inline as new tuples; old versions become dead and are reclaimed by vacuum. Tuple visibility is determined by comparing xmin/xmax to the transaction’s snapshot. Dead-tuple bloat is the principal operational cost of PostgreSQL MVCC; autovacuum and (since v13) parallel vacuum mitigate. PG 14+ adds optimisations like the “snapshot too old” detection and improvements to subtransaction cache. PG 16/17 introduce logical replication conflict resolution improvements.
- Oracle. Undo segments hold pre-images; current rows are mutated in place. Readers reconstruct prior versions by chasing undo records. Avoids vacuum-style bloat but requires sufficient undo retention; ORA-01555 “snapshot too old” surfaces when undo is recycled.
- SQL Server (READ_COMMITTED_SNAPSHOT and SNAPSHOT_ISOLATION). Row versions in tempdb; in-place update with version chain in version store.
- InnoDB (MySQL/MariaDB). Undo log holds prior versions; rollback segments. Repeatable Read default isolation.
- MongoDB (WiredTiger). Per-document MVCC; readers see snapshot as of operation start.
- Spanner, CockroachDB, Yugabyte. Hybrid logical clocks + MVCC per row.
2.3 Serializable isolation
Snapshot Isolation permits write skew anomalies (concurrent transactions update disjoint rows whose combination violates an invariant). Strong serializability requires excluding them.
- Serializable Snapshot Isolation (SSI) — Cahill-Röhm-Fekete 2008 SIGMOD, refined Ports-Grittner 2012. Add SIREAD predicate-aware tracking on top of SI; detect dangerous structures (rw-antidependency patterns) and abort one of the participating transactions. PostgreSQL since 9.1 implements SSI as the SERIALIZABLE level.
- Two-phase locking + index/next-key locks. InnoDB SERIALIZABLE; SQL Server SERIALIZABLE. Acquires range locks on indexes to prevent phantoms.
- Optimistic concurrency control (OCC, Kung-Robinson 1981). Validate at commit time. Used by FoundationDB, HekatonOCC, Wiredtiger optimistic path. Friendly to read-heavy workloads with low conflict rates.
- MVCC + timestamp ordering (Spanner, CockroachDB, Yugabyte). Each row write/read has timestamp; serializable order is timestamp order. TrueTime in Spanner; HLC in CockroachDB.
2.4 Write-ahead logging and ARIES
Mohan-Haderle-Lindsay-Pirahesh-Schwarz 1992 TOPLAS 17 ARIES is the canonical recovery algorithm:
- WAL. Force log to disk before flushing dirty data pages (write-ahead). Each log record has LSN (log sequence number).
- Repeating history during redo. Replay log from the last checkpoint, restoring exact state at crash.
- Compensation log records (CLRs). Logged during undo; allow incomplete undo to be re-applied without infinite regress.
- Steal/no-force buffer pool. Steal: dirty pages may be evicted before commit. No-force: committed pages need not be flushed at commit time. Maximizes throughput.
Production: PostgreSQL pg_wal, InnoDB redo log, Oracle redo logs, SQL Server transaction log, MongoDB journal/oplog. Group commit batches multiple commits into one fsync to amortize disk latency.
2.5 Snapshot management and garbage collection
PostgreSQL VACUUM (autovacuum default) reclaims dead tuples; VACUUM FULL rewrites table to compact (acquires AccessExclusiveLock). InnoDB undo purge (purge_threads workers). Oracle undo retention. SQL Server version store cleanup.
Long-running transactions block GC because their snapshots may still need old versions; the “xmin horizon” in PostgreSQL pins the oldest visible XID and prevents dead-tuple removal. Operational pathology: long ANALYZE or BI query holds back vacuum on busy OLTP system, accumulating bloat.
2.6 Hekaton — in-memory OLTP
Larson-Blanas-Diaconu-Freedman-Patel-Zwilling 2011 VLDB; SQL Server’s “Hekaton” engine. Native compilation of stored procedures to C; lock-free Bw-tree and hash indexes; MVCC with optimistic concurrency; no buffer pool (data permanently in memory) but durable via WAL. Modern in-memory engines (HANA, MemSQL/SingleStore, VoltDB, HyPer) share architectural elements.
3. Query processing
A query processor parses SQL, type-checks against the catalog, transforms the parse tree to a logical plan, optimizes to a physical plan, and executes.
3.1 Query optimizers
Two architectural families:
- System R / Volcano (Selinger-Astrahan-Chamberlin-Lorie-Price 1979 SIGMOD + Graefe 1993). Bottom-up dynamic programming over join orderings; cost models based on cardinality estimates and per-operator cost functions. Used by SQL Server’s classic optimizer, PostgreSQL, MySQL.
- Cascades / top-down memoization (Graefe 1995 Bulletin of the IEEE TCDE). Rule-driven transformation framework with a memo (DAG of equivalent expressions); applies transformation and implementation rules until no improvement. SQL Server’s modern QO, Microsoft Synapse, GPORCA (Greenplum), Apache Calcite, CockroachDB’s optimizer (cost-based, internally Cascades-influenced), Snowflake.
Critical optimizer subproblems:
- Cardinality estimation. Histograms (PostgreSQL pg_stats), TopK lists, sketches (HyperLogLog for distinct counts). Multi-column statistics (PostgreSQL CREATE STATISTICS, SQL Server filtered/multi-column stats) for correlated predicates. Persistent under-estimation of join cardinalities remains the dominant cause of bad plans; ML-based estimators (Kipf-Kemper-Ng-Mantilla-Halvorsen-Tatbul 2019; Wang-Ding-Zhang-Wu-Wang-Yu-Wang 2021) explore neural cardinality estimators.
- Join enumeration. Left-deep tree (System R) for limited search space; bushy tree for more flexibility. Dynamic programming O(3^n) for n tables (DPSize) or O(n × 2^n) (DPCCP, Moerkotte-Neumann 2006). Heuristic and genetic algorithms for very large queries (PostgreSQL GEQO).
- Cost model. Operator-level cost (I/O cost, CPU cost, memory cost). Calibration against the buffer pool fit and table sizes. Per-row vs per-page costing.
- Plan hints / outlines. PostgreSQL pg_hint_plan (Tatsuo Ishii et al.), Oracle hints, SQL Server query store + forced plans. Used to override the optimizer when statistics or cost model misleads.
3.2 Query execution models
- Volcano iterator model. Each operator pulls next() from its inputs; rows flow up the tree. Simple but high per-tuple overhead (a function call per row).
- Compiled query execution. Generate machine code (LLVM, Cranelift) per query. HyPer (Neumann 2011 VLDB) introduced data-centric compilation that materializes pipelines as tight loops with operator code inlined. Umbra (Neumann-Freitag 2020 CIDR) extends with adaptive execution. Spark Tungsten (Whole-Stage Codegen) and Snowflake compile queries to JVM bytecode/JIT.
- Vectorized execution. Process batches (typically 1024–8192 rows) per call rather than individual tuples; amortize call overhead. MonetDB X100 (Boncz-Zukowski-Nes 2005 CIDR “MonetDB/X100”), VectorWise (later Actian Vector), DuckDB, ClickHouse, Velox (Meta), Photon (Databricks). The vectorization vs codegen debate remains open: Kersten-Leis-Kemper-Neumann-Pavlo-Boncz 2018 VLDB “Everything You Always Wanted to Know about Compiled and Vectorized Queries but Were Afraid to Ask” argues each wins in different regimes.
3.3 Specific operators
- Hash join. Build phase reads smaller (build) input into a hash table on join keys; probe phase scans larger (probe) input, looking up matches. Linear-probing hash tables are cache-friendlier than chained. Grace hash join (Kitsuregawa-Tanaka-Moto-oka 1983 NGCS 1) partitions inputs by hash to handle build sides larger than memory; recursive partitioning when partitions exceed memory.
- Sort-merge join. Sort both inputs on join keys; merge. Useful when inputs are already sorted (e.g., on indexed keys) or when supporting non-equi joins.
- Nested-loop join. Used for small outer + indexed inner, or for non-equi joins where hash/merge are inapplicable. Block nested-loop reads outer in pages to amortize.
- Aggregation. Hash aggregation (group-by builds hash table on grouping keys); sort aggregation (already-sorted input streams through aggregate). HashAgg with spill to disk for large group cardinalities.
- Index intersection / bitmap heap scan. PostgreSQL bitmap index scans build a bitmap per qualifying index, AND/OR-combine, then fetch rows from heap in order. SQL Server index intersection.
3.4 Adaptive query processing
Runtime feedback adjusts plans:
- Adaptive Query Execution (AQE, Spark since 3.0). Re-plan after shuffle based on actual statistics; coalesce partitions, switch join strategy, handle skew.
- Photon adaptive execution (Databricks). Vectorized execution with runtime adaptation.
- HyPer adaptive compilation (Kohn-Leis-Neumann 2018 ICDE). Choose between interpreted, vectorized, and compiled execution per pipeline based on size and cost.
- PostgreSQL JIT (since 11). LLVM-based JIT compilation of expressions and tuple deforming for large queries.
4. Columnar engines and vectorization
Modern OLAP engines fully integrate columnar storage and vectorized execution.
4.1 DuckDB
Raasveldt-Mühleisen 2019 SIGMOD demo. Embedded analytic DBMS, columnar, vectorized. Predicate pushdown, late materialization, parallelism via morsel-driven scheduling (Leis-Boncz-Kemper-Neumann 2014 SIGMOD). C++ codebase; widely used as the Python/R analytic backend.
4.2 ClickHouse
Yandex 2016. Columnar MergeTree engine. Primary key sparse index, granule-based skipping, secondary skipping indexes (min-max, set, bloom), data parts merged in background like LSMs (despite being columnar). Strong on time-series and immutable log analytics.
4.3 Velox (Meta)
Pedreira-Erling-Aravind-Anantha-Behera-Borkar-Brost-Calheiros-Carosi-Choi-Cui-Du-Garlick-Gindel-Goldfeder-Hong-Hua-Karp-Kasturi-Kobazev-Kollipara-Kostarev-Kotrabhavi-Lawler-Le-Liu-Marwah-Mathew-Mathur-Meng-Mishra-Mistry-Mukherjee-Neumann-Niedzwiedz-Ouyang-Paalanen-Padalkar-Pal-Pandey-Park-Patwardhan-Peng-Pirahesh-Pohorile-Prasad-Prasad-Premkumar-Qiao-Raman-Rao-Ren-Rocha-Sastry-Sengupta-Shah-Sinha-Sridharan-Subbaiah-Tao-Vakharia-Walisko-Walters-Wang-Wang-Xie-Yang-Yang-Yu-Zatterer-Zhang-Zhang-Zhao-Zhao 2022 VLDB. C++ vectorized execution library used by Presto, Spark (Velox-Spark), F3 (Meta’s analytics), and PyTorch DataLoader. Designed for embedding.
4.4 Apache Arrow + DataFusion + Polars
DataFusion (Rust, Apache) is a query engine on Arrow; Polars uses similar foundations as a DataFrame library with lazy/streaming evaluation. Both compete with pandas + DuckDB for in-process analytics.
4.5 Apache Druid, Pinot, ClickHouse, Doris, StarRocks
OLAP / real-time analytics systems with various trade-offs between freshness, latency, and SQL completeness.
4.6 Vectorized kernel libraries
Folly (Meta), Highway (Google), Intel ISPC, oneDAL, Apache Arrow Compute, Velox VectorOps. Provide SIMD-accelerated primitives for filtering, hashing, aggregation, sorting.
5. Distributed databases
Distribution multiplies the design space.
5.1 Replication models
- Single-leader replication. All writes go to leader; leader streams log to followers. PostgreSQL streaming replication, MySQL binlog replication, MongoDB replica set.
- Multi-leader (active-active). Each replica accepts writes; conflicts resolved by last-write-wins, version vectors, or CRDTs. Postgres BDR, MySQL Group Replication, CouchDB.
- Leaderless. Quorum reads/writes; Cassandra, ScyllaDB, DynamoDB. Tunable consistency (ONE, QUORUM, ALL). Hinted handoff and read-repair for eventual consistency.
- Consensus-based (Raft, Paxos). Strong consistency. CockroachDB, Yugabyte, Spanner, TiDB, FoundationDB, etcd, Consul.
5.2 Spanner and TrueTime
Corbett-Dean-Epstein-Fikes-Frost-Furman-Ghemawat-Gubarev-Heiser-Hochschild-Hsieh-Kanthak-Kogan-Li-Lloyd-Melnik-Mwaura-Nagle-Quinlan-Rao-Rolig-Saito-Szymaniak-Taylor-Wang-Woodford 2012 OSDI — Spanner is Google’s globally distributed SQL database. Each Spanner shard runs a Paxos group across multiple zones. The key innovation is TrueTime: an API exposing time as an interval [earliest, latest] synchronized via GPS receivers and atomic clocks. Spanner commits at a timestamp t and waits until TT.now() > t before acknowledging — guaranteeing external consistency. Read-only transactions execute at a chosen safe timestamp without locking.
5.3 Calvin
Thomson-Diamond-Weng-Ren-Shao-Abadi 2012 SIGMOD — Calvin sequences transactions through a global ordering layer (a deterministic agreement protocol), then executes deterministically at replicas without further coordination. Used by FaunaDB.
5.4 CockroachDB
Taft-Sharif-Matei-Mohan-Mirzaie-Mubarak-Cao-Chen-Fan-Loo-Manhas-Patil-Pillai-Rocha-Sankalp-Su-Wong-Wu-Yokota-Yulianus-Yermolaev-Yuan-Zhang-Zhou-Zhou-Zhou 2020 SIGMOD. Distributed SQL inspired by Spanner; uses hybrid logical clocks (HLC, Kulkarni-Demirbas-Madappa-Avva-Leone 2014) instead of GPS-synchronized TrueTime. Range-based partitioning of the keyspace; each range is a Raft group. Online schema changes via the F1 schema-change protocol (Rae-Rollins-Shute-Sodhi-Vingralek 2013 VLDB).
5.5 Yugabyte
YugabyteDB layers PostgreSQL frontend on a Raft-replicated DocDB storage. Multi-region with read replicas, tablet-level Raft groups.
5.6 TiDB and TiKV
PingCAP. TiKV is a Raft-replicated RocksDB-backed KV store; TiDB layers a MySQL-compatible SQL engine. TiFlash provides columnar replicas for HTAP via the same Raft groups.
5.7 FoundationDB
Zhou-Bao-Bond-Chen-Cubukcu-Doswell-Doyle-Helfgott-Jia-Khan-Klein-Klosterhalfen-Lan-Liu-Lloyd-Lochner-McKusick-Patton-Pulapaka-Ratiu-Rini-Sankaran-Schmidt-Tomic-Toler-Wuolijoki 2021 SIGMOD. Ordered key-value store with ACID transactions via OCC; deterministic simulation testing. Used by Apple (CloudKit), Snowflake (metadata), and others. Layered architecture: any data model is implementable as a layer on top of the KV core.
5.8 Aurora
Verbitski-Gupta-Saha-Brahmadesam-Gupta-Mittal-Krishnamurthy-Maurice-Kharatishvili-Bao 2017 SIGMOD (and 2018 SIGMOD multi-master extension). AWS-only PostgreSQL/MySQL fork with separated storage. Log records are shipped to a multi-tenant storage layer that maintains 6 copies across 3 AZs; quorum 4-of-6 writes, 3-of-6 reads. Compute instances are stateless; failover is fast because storage outlives them. Aurora Limitless Database (Re:Invent 2023) adds sharding.
5.9 PlanetScale, Neon
Serverless / branchable Postgres and MySQL. Neon (2022) separates compute from storage with a Pageserver/Safekeeper architecture; supports branching via copy-on-write. PlanetScale (Vitess-based MySQL) provides online schema changes via ghost-table technique.
5.10 HTAP — hybrid transactional/analytical
A single system serving both OLTP and OLAP. Approaches:
- Single engine. SAP HANA, MemSQL/SingleStore, Oracle In-Memory.
- Co-located row + column replicas. TiDB + TiFlash; SQL Server columnstore.
- Decoupled with consistent replication. Snowflake Unistore, Cockroach + change data capture to data warehouse.
5.11 Shared-disk vs shared-nothing
- Shared-disk. Oracle RAC, IBM PureScale, Aurora, Snowflake. Storage is shared; compute coordinates via distributed lock manager (cache fusion in Oracle RAC).
- Shared-nothing. Postgres-XL, CockroachDB, Yugabyte, Spanner. Each node owns its data partitions.
5.12 Vitess
Slawski-Ushakov-Naik-… Originally YouTube’s MySQL sharding layer; later open-source; powers PlanetScale, GitHub, Slack, Square, Shopify (in part), Pinterest. Provides query routing, connection pooling, online resharding via VReplication.
6. Time-series, graph, vector databases
Specialised engines optimised for non-relational workloads:
- Time-series. InfluxDB IOx (Apache Arrow + DataFusion), TimescaleDB (PostgreSQL extension), QuestDB, Prometheus TSDB, M3DB, ClickHouse for high-cardinality time-series, VictoriaMetrics.
- Graph. Neo4j (Cypher), Memgraph, JanusGraph, Amazon Neptune, TigerGraph, Dgraph. RDF stores: Apache Jena, Stardog, GraphDB.
- Vector / similarity. pgvector (Postgres extension), Pinecone, Milvus, Weaviate, Qdrant, Vespa, ScaNN, FAISS. See rag-embeddings-vector-search for the full taxonomy.
- Document. MongoDB, Couchbase, ArangoDB, RavenDB.
- Wide-column. Cassandra, ScyllaDB (C++ rewrite of Cassandra protocol), HBase, Google Bigtable.
- Key-value. Redis, Memcached, etcd, Aerospike.
- Embedded. SQLite (still by far the most-deployed database), DuckDB, RocksDB-as-library.
7. Cloud-native warehouses
- Snowflake. Multi-cluster, shared-data architecture. Storage on S3; compute is “virtual warehouses” (independent clusters per workload). Micro-partitions of FDN (Snowflake’s columnar format). Time travel via undrop and time travel periods (1–90 days). Cloning is metadata-only.
- BigQuery. Dremel architecture (Melnik-Gubarev-Long-Romer-Shivakumar-Tolton-Vassilakis 2010 VLDB; updated 2020 paper). Capacitor columnar storage on Colossus; Borg/Dremel query execution; serverless.
- Redshift. Originally PostgreSQL fork with ParAccel columnar; recently rearchitected as Redshift RA3 with managed storage. Spectrum for external Parquet/ORC.
- Databricks SQL / Photon. Spark + Photon vectorized execution; Delta Lake storage.
- Synapse, Microsoft Fabric (Lakehouse). Azure analytics platform; Polaris distributed engine.
- Firebolt, ClickHouse Cloud, MotherDuck. Smaller-vendor entrants.
Decoupled storage and compute is the defining architecture: data lives in cheap object storage (S3, GCS, Azure Blob), compute is elastic, often spot-priced, and stateless. The shift away from disk-attached MPP (Vertica, Greenplum, Netezza) toward separated storage is largely complete in greenfield deployments.
8. Open table formats and the lakehouse
- Apache Iceberg (Netflix-Apple-Apple-Tabular). Manifest-list + manifest-files structure tracks all data and delete files. Snapshot isolation, schema evolution, partition evolution (no rewrite). Row-level deletes via equality-delete-files and positional-delete-files. Iceberg V3 (2024) adds variant data type, default values, and more.
- Delta Lake (Databricks). Transaction log with JSON commits; checkpoint files every 10 commits. Optimised for Spark workloads; recent Delta Universal Format (UniForm) bridges to Iceberg metadata.
- Apache Hudi (Uber). Copy-on-write or merge-on-read. Primary key with upsert semantics.
- Apache Paimon (Flink CDC origin). Streaming-first lakehouse table format.
Catalog services: Iceberg REST Catalog, AWS Glue Data Catalog, Unity Catalog (Databricks), Polaris Catalog (Snowflake open-sourced 2024).
9. Replication, change data capture, streaming
- Logical replication. PostgreSQL logical decoding (since 9.4) emits decoded WAL changes; pglogical and Debezium consumers. MySQL binlog replication.
- Debezium. Apache Kafka Connect CDC pipeline supporting Postgres, MySQL, MongoDB, Oracle, SQL Server, Cassandra, Db2.
- Apache Flink CDC + Apache Iceberg + Apache Paimon. Streaming lakehouse stack.
- Kafka, Apache Pulsar, AWS Kinesis, Google Pub/Sub. Log-based brokers underlying CDC pipelines.
- Materialized views and incremental view maintenance. Materialize (Differential Dataflow, McSherry-Murray-Isaacs-Isard 2013 SOSP); Snowflake Dynamic Tables; PostgreSQL pg_ivm extension.
10. Operational concerns
- Backups. Physical backups (pg_basebackup + WAL archiving for PITR; xtrabackup for MySQL); logical (pg_dump, mysqldump). Backup verification (restore drills) is the practical guarantee. Object-storage-based backup tools: pgBackRest, WAL-G, Barman.
- Schema migrations. Online vs blocking. Tools: gh-ost (GitHub), pt-online-schema-change (Percona), pg_repack, Lever (Stripe), Reshape (Fabian), Atlas (Ariga), Sqitch, Liquibase, Flyway. Avoiding locks: shadow table + trigger-based copy + rename swap.
- Connection pooling. pgbouncer, Odyssey, PgCat for Postgres; ProxySQL, MaxScale for MySQL; RDS Proxy. Pooling modes: session, transaction, statement.
- Index maintenance. Bloat detection, REINDEX CONCURRENTLY (Postgres), online index rebuild (Oracle, SQL Server).
- Statistics maintenance. ANALYZE (Postgres), UPDATE STATISTICS (SQL Server, Oracle), autostats jobs.
- Observability. pg_stat_statements, pg_stat_activity (Postgres); performance_schema, sys.* (MySQL); v$views (Oracle); QueryStore (SQL Server); MongoDB profiler. eBPF-based tools: dbtop, mongocat. APM integration: Datadog DBM, New Relic, Honeycomb.
11. Hardware trends and storage
- NVMe SSDs. Sub-100µs read latencies; 1–7 GB/s throughput. Storage-class memory (Intel Optane, defunct since 2022 product line cancellation) blurred RAM and disk; CXL.mem may reincarnate the architecture.
- Persistent memory (PMEM). Direct CPU-mapped persistent storage with byte-addressable load/store; engines specialising on PMEM include FaRM (Microsoft), Pelican, OurDB. Intel Optane discontinuation slowed adoption.
- RDMA networking. Sub-2µs RPC across InfiniBand or RoCE. Database systems exploiting RDMA: FaRM, FaSST, NAM-DB, Tell, RemusDB.
- Disaggregated memory and CXL. Compute Express Link 2.0/3.0 enables cache-coherent memory pooling; databases (Aurora, Snowflake-style) anticipate using CXL fabrics for shared buffer pools.
- Custom hardware. Oracle Exadata smart-scan offloading to storage cells; Aurora cooked logical storage; Azure SQL hyperscale with page server tier; AWS Aurora I/O-Optimized.
12. Benchmarking and validation
- TPC benchmarks. TPC-C (OLTP), TPC-H (decision support), TPC-DS (decision support, more queries), TPC-DI (data integration), TPCx-AI (AI workloads). Submit-and-audit process publishes verified results.
- YCSB (Yahoo Cloud Serving Benchmark, Cooper-Silberstein-Tam-Ramakrishnan-Sears 2010 SoCC). Standard for KV store benchmarking.
- Sysbench, HammerDB. OLTP benchmarks; HammerDB implements TPC-C-like and TPC-H-like.
- JOB (Join Order Benchmark, Leis-Gubichev-Mirchev-Boncz-Kemper-Neumann 2015 VLDB). IMDB-based queries to stress optimizer cardinality estimates and join orderings.
- STAR Schema Benchmark (SSB), LDBC Social Network Benchmark.
- CRDT correctness (Jepsen). Kingsbury (Kyle Kingsbury) has documented isolation, consensus, and consistency bugs in MongoDB, Cassandra, CockroachDB, Postgres, FoundationDB, etcd, ZooKeeper, MongoDB, Yugabyte, TiKV, FaunaDB, and many others. Jepsen reports are the de facto correctness audit for distributed databases.
13. Buffer pools and caching
The buffer pool caches database pages in DRAM; cache replacement and write-back policies determine OLTP throughput.
13.1 Replacement policies
- LRU (Least Recently Used). Standard baseline. Susceptible to scan resistance — a large sequential scan can flush hot pages.
- LRU-K (O’Neil-O’Neil-Weikum 1993 SIGMOD). Track K-most-recent accesses; LRU-2 is common.
- 2Q (Johnson-Shasha 1994 VLDB). Two queues protect against scan pollution.
- Clock / Clock-Pro. Approximate LRU with one-bit references; widely used (PostgreSQL clock-sweep, InnoDB Adaptive Hash + LRU).
- ARC (Adaptive Replacement Cache, Megiddo-Modha 2003 FAST). Self-tuning between recency and frequency; previously default in IBM DB2, now in Apple file systems.
- CAR / CART. Clock variants of ARC.
PostgreSQL uses clock-sweep with shared_buffers typically 25% of RAM; relies on OS page cache for additional buffering (one of few systems doing this). InnoDB uses LRU with a “young/old” sublist split (default 37% of pool old) to resist scan pollution.
13.2 Direct I/O vs buffered I/O
InnoDB uses O_DIRECT to bypass OS cache and avoid double caching; PostgreSQL uses buffered I/O. The OS-cache approach is simpler but loses fine-grained control over write-back ordering — a hazard for crash consistency that PG mitigates via fsync() and O_DSYNC. PostgreSQL 16 added I/O concurrency tunables; PG 17 introduced asynchronous I/O.
13.3 Write-back and checkpointing
Dirty pages are eventually flushed via checkpointer (Postgres background writer + checkpointer) or cleaner threads (InnoDB page cleaner). Aggressive checkpointing reduces recovery time but creates write storms; gentle checkpointing spreads I/O. The checkpoint timeout / max_wal_size (Postgres) and innodb_io_capacity (MySQL) tune the trade-off.
13.4 Compression in the buffer pool
ZSTD and LZ4 compression at the page or column-chunk level; sometimes only on disk, sometimes also in memory. Reduces effective working set at CPU cost.
14. Indexes beyond B-trees
14.1 Hash indexes
O(1) point lookups; no range support. PostgreSQL hash indexes (WAL-logged since 10), Redis hash maps. Linear hashing (Litwin 1980) and extendible hashing (Fagin-Nievergelt-Pippenger-Strong 1979) avoid full rehash on growth.
14.2 GiST, GIN, BRIN, SP-GiST (PostgreSQL)
- GiST (Generalized Search Tree, Hellerstein-Naughton-Pfeffer 1995 VLDB). Pluggable tree for spatial, full-text, range types. Underlies PostGIS R-trees, full-text trigrams.
- GIN (Generalized Inverted Index). Inverted index for arrays, jsonb, full-text. Maintains a B-tree of keys, each pointing to a posting list.
- BRIN (Block Range INdex). Tiny summary per page range (typically min-max per 128 pages); only effective on physically-ordered data. Used for time-series.
- SP-GiST (Space-Partitioned GiST). k-d trees, quad trees, radix trees for non-balanced spatial data.
14.3 R-trees and spatial indexes
R-tree (Guttman 1984), R*-tree (Beckmann-Kriegel-Schneider-Seeger 1990) — overlapping bounding boxes; standard for spatial queries. R+-tree and Hilbert R-tree variants reduce overlap. PostGIS uses GiST-wrapped R-trees. Spatial-disjoint Z-order and Hilbert curves provide alternatives.
14.4 Trie and radix structures
Adaptive Radix Tree (ART, Leis-Kemper-Neumann 2013 ICDE) used in HyPer, DuckDB, Umbra, MemSQL. Path compression and adaptive node fanout. HOT (Height Optimized Trie, Binna-Zangerle-Pichl-Specht-Leis 2018 SIGMOD).
14.5 Learned indexes
Kraska-Beutel-Chi-Dean-Polyzotis 2018 SIGMOD — replace B-tree with cumulative-distribution-function model (e.g., piecewise linear or neural network); lookup returns predicted position plus error bound. Performance impressive in static workloads; updates more challenging. Subsequent work: ALEX (Ding-Minhas-Zhang-Yan-Kraska-Chaudhuri-Chandramouli 2020 SIGMOD), PGM-Index (Ferragina-Vinciguerra 2020 VLDB), RadixSpline.
14.6 Vector indexes
HNSW (Hierarchical Navigable Small World, Malkov-Yashunin 2018 IEEE TPAMI), IVF, IVF-PQ, ScaNN, SPTAG, DiskANN. Approximate nearest neighbor search. See rag-embeddings-vector-search for the full taxonomy. pgvector (PostgreSQL extension) integrates HNSW and IVF into a standard SQL database.
15. Streaming, materialized views, and incremental computation
Materialized views precompute query results; the key challenge is maintaining them under base-table changes.
15.1 Incremental view maintenance
- Counting algorithm (Gupta-Mumick-Subrahmanian 1993 SIGMOD). Maintain count of derivations per result row; update on delta.
- Recursive Datalog (DReD — Delete-Rederive-Insert). Handles recursion at cost of redundant work.
- Differential dataflow (McSherry-Murray-Isaacs-Isard 2013 SOSP). Streaming computation with logical timestamps; supports recursion. Powers Materialize.
- Higher-order incremental computation (DBSP, Budiu-Chajed-McSherry-Ryzhyk-Tannen 2023 VLDB). Mathematical foundation; circuits over differential streams.
15.2 Materialize, ksqlDB, Snowflake Dynamic Tables, Apache Pinot real-time tables, BigQuery materialized views
Production systems implementing IVM with different consistency, latency, and SQL-feature trade-offs.
15.3 Stream processing engines
Apache Flink, Apache Beam, Spark Structured Streaming, Apache Kafka Streams, Apache Storm (legacy). Distinct semantics around event-time vs processing-time, watermarks, exactly-once vs at-least-once guarantees. Akidau-Bradshaw-Chambers-Chernyak-Fernández-Moctezuma-Lax-McVeety-Mills-Perry-Schmidt-Whittle 2015 VLDB “The Dataflow Model” formalised event-time stream processing.
16. Geo-distribution patterns
16.1 Read replicas and follower reads
Asynchronously replicated read-only replicas serve geographically distant clients. Stale reads accept eventual consistency for latency. CockroachDB Bounded Staleness Reads, Spanner read-only transactions, AWS Aurora Global Database (with sub-second lag claims).
16.2 Multi-region writes
Spanner multi-region tables with leader placed in a configured region. CockroachDB MULTIREGION SURVIVE REGION FAILURE. Yugabyte tablespaces. Trade-offs: write latency floor by Paxos round-trips, durability vs availability under regional partition.
16.3 Edge databases
LiteFS (Fly.io), Turso (libSQL), Cloudflare D1 — SQLite replication at edge with primary-region writes. Targeting low-latency read at point of presence.
17. Open problems
- Optimizer cardinality estimation. Persistent gap between estimated and actual cardinalities in complex queries; ML-based estimators promising but not widely deployed.
- Adaptive vs compiled vs interpreted. Right answer depends on workload; runtime adaptation between modes is incompletely automated.
- Distributed transactions across heterogeneous engines. No standard. XA exists but is rarely used in microservice architectures; Saga pattern (Garcia-Molina-Salem 1987 SIGMOD) is the modern alternative but provides only eventual consistency.
- Schema evolution at scale. Online schema changes remain risky; rollback under load is hard.
- Vector database integration. Where vector indexes fit relative to standard B-tree/hash indexes — within the same transactional engine vs separate systems — is still being worked out.
- Hybrid OLAP/OLTP cost model. Mixing analytical and transactional in a single optimizer is not solved.
- Cost models for cloud-priced execution. Sky computing — choosing the cheapest cloud per query — is research-stage.
Further reading
- Hellerstein, J. M., M. Stonebraker, and J. Hamilton 2007. “Architecture of a Database System.” Foundations and Trends in Databases 1.
- Garcia-Molina, H., J. D. Ullman, and J. Widom 2008. Database Systems: The Complete Book (2nd ed.).
- Kleppmann, M. 2017. Designing Data-Intensive Applications.
- Petrov, A. 2019. Database Internals.
- Bernstein, P. A. and E. Newcomer 2009. Principles of Transaction Processing (2nd ed.).
- Mohan, C. et al. 1992. “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.” TOPLAS 17.
- Selinger, P. G. et al. 1979. “Access path selection in a relational database management system.” SIGMOD.
- Berenson, H. et al. 1995. “A critique of ANSI SQL isolation levels.” SIGMOD.
- Graefe, G. 1993. “Query evaluation techniques for large databases.” ACM Computing Surveys 25.
- Cahill, M. J., U. Röhm, and A. Fekete 2008. “Serializable isolation for snapshot databases.” SIGMOD.
- Neumann, T. 2011. “Efficiently compiling efficient query plans for modern hardware.” VLDB.
- Boncz, P. A., M. Zukowski, and N. Nes 2005. “MonetDB/X100: Hyper-pipelining query execution.” CIDR.
- Kemper, A. and T. Neumann 2011. “HyPer: A hybrid OLTP & OLAP main memory database system based on virtual memory snapshots.” ICDE.
- Corbett, J. C. et al. 2012. “Spanner: Google’s globally-distributed database.” OSDI.
- Verbitski, A. et al. 2017. “Amazon Aurora: Design considerations for high throughput cloud-native relational databases.” SIGMOD.
- Taft, R. et al. 2020. “CockroachDB: The resilient geo-distributed SQL database.” SIGMOD.