Database Internals — Compute Reference

1. At a glance

Every general-purpose database management system, OLTP or OLAP, is built from the same small set of architectural layers. A client process opens a connection and sends SQL (or a similar query language) over the wire. A parser tokenizes the input and produces an abstract syntax tree. A binder resolves names against the catalog. A query planner / optimizer chooses physical operators and an execution order using table statistics. The executor pulls or pushes tuples through that operator tree. Beneath the executor, the storage engine materializes pages, manages the buffer pool, persists changes to a write-ahead log, and handles concurrency control. A transaction manager and recovery manager guarantee atomicity and durability across crashes.

What differs between engines is mostly how each of those layers is built and which trade-offs they bias toward. The single biggest discriminator is the storage engine, and within storage engines the dominant split is between B-tree page systems (PostgreSQL, MySQL InnoDB, SQL Server, Oracle, SQLite, LMDB) and log-structured merge tree systems (RocksDB, LevelDB, Cassandra, ScyllaDB, HBase, and the storage layer underneath CockroachDB, TiDB, and YugabyteDB). Columnar OLAP engines (Snowflake, BigQuery, Redshift, ClickHouse, DuckDB, Druid, Pinot) reuse the same skeleton but reorder data by column instead of by row, and they execute in vectorized batches rather than tuple at a time. Once you understand the row-store skeleton + one columnar engine, every new database largely fits the pattern.

2. Storage engines

2.1 B-tree and B+-tree

The B-tree, introduced by Rudolf Bayer and Edward McCreight in 1972 (“Organization and Maintenance of Large Ordered Indices”, Acta Informatica), is a self-balancing search tree with high fan-out designed for block-oriented storage. Each node fills one disk page (typically 4 KiB to 16 KiB). With fan-out f and N keys the tree height is O(log_f N), which is usually three to five levels for tables up to the low billions of rows — so any point lookup is at most a handful of page reads.

The B+-tree is the variant that virtually all production relational databases actually use. It stores keys and values only in leaf pages, and the leaves are linked in a doubly-linked list, which makes range scans cheap (walk one leaf to the next without going back up the tree). PostgreSQL’s default index access method, MySQL InnoDB’s clustered primary index, SQL Server’s clustered index, Oracle’s index-organized tables, and SQLite’s tables are all B+-trees.

Updates are in place: to change a value, the engine finds the right page, modifies it, and (eventually) writes the modified page back. Pages are split when they overflow and merged or rebalanced when they underflow. The cost is write amplification — to change one byte you must rewrite at least one whole page plus its WAL record — and contention on hot pages from concurrent writers.

2.2 Log-structured merge tree (LSM)

The LSM tree was introduced by Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil in 1996 (“The Log-Structured Merge-Tree”, Acta Informatica). Writes go first to an in-memory ordered structure (the memtable, typically a skip list or red-black tree) plus an append-only WAL. When the memtable reaches a threshold it is flushed as an immutable, sorted file on disk called an SSTable (Sorted String Table — the term originates from Google’s Bigtable paper). Background compaction merges SSTables in tiers or levels to reclaim space from overwritten or deleted keys and to keep read paths short.

LSM trees are write-optimized: every write is a sequential append to memory and then to disk, which is dramatically faster than the random page I/O of a B-tree. The cost is read amplification (a read may have to check the memtable plus several SSTables before finding a key) and space amplification (obsolete versions live on disk until compaction removes them). Bloom filters per SSTable + per-file min/max key indexes mitigate read amp by skipping files that cannot contain the key.

Production LSM engines: RocksDB (Facebook, 2012, forked from LevelDB), LevelDB (Google, 2011), Cassandra and ScyllaDB, HBase, MongoDB’s WiredTiger (configurable as LSM), and the Pebble engine (CockroachDB’s Go-native rewrite of RocksDB, 2020).

2.3 The trade-off in one line

PropertyB-treeLSM
Write amplificationHigher (page-grain)Lower (append)
Read amplificationLower (one path)Higher (memtable + N SSTables)
Space amplificationLower (in place)Higher (until compaction)
Sequential write rateModerateVery high
Update-in-place workloadExcellentPenalized
Append-mostly workloadGoodExcellent
Predictable read latencyYesSubject to compaction jitter

2.4 Hybrids and other shapes

  • LMDB (Howard Chu, 2011) — B+-tree with copy-on-write pages and a single writer; gives MVCC reads without locking, used inside OpenLDAP and many embedded systems.
  • Bw-tree (Levandoski, Lomet, Sengupta, ICDE 2013) — latch-free B-tree with delta records, used in Microsoft Hekaton in-memory engine and Azure Cosmos DB.
  • Fractal tree (Tokutek, ca. 2007) — B-tree variant with cascaded message buffers at internal nodes; shipped as TokuDB / TokuMX, now in Percona Server.
  • Heap + secondary indexes — PostgreSQL stores tuples in an unordered heap file and points to them from every index, including the primary key index. Trade-off: cheap secondary indexes (only one place to update the row) but a HOT/VACUUM machinery to manage dead tuples.

3. Indexing

Indexes are auxiliary structures the planner uses to skip work. Choosing them correctly often matters more than every other tuning lever combined.

3.1 Clustered vs non-clustered

A clustered (primary) index defines the physical order of the table itself — InnoDB, SQL Server, and Oracle IOTs all cluster the heap on the primary key. Lookups by primary key are essentially free; secondary index entries store the primary key as the row pointer, so secondary lookups do two B+-tree descents.

A non-clustered index sits alongside an unordered heap (PostgreSQL, Oracle default). Each index entry holds a physical row pointer (the ctid in Postgres). Updates that move a row (HOT updates avoid this) must update every secondary index.

3.2 Other index shapes

  • Hash index — O(1) expected lookup, no range or order. Postgres has a WAL-logged hash AM; MySQL’s MEMORY engine has hash indexes natively.
  • Inverted index — token → posting list of doc IDs. The foundation of Lucene, Elasticsearch, OpenSearch, Solr, and Postgres GIN. Used for full-text search and for arrays and JSONB containment queries.
  • Bitmap index — one bit per row per distinct value; ideal for very low cardinality columns in OLAP. Native in Oracle and in Postgres’ bitmap-heap scan plan node (built on the fly during query execution).
  • R-tree (Guttman 1984) and quadtree — spatial / multidimensional. PostGIS uses GiST, which can implement R-tree; SQL Server and Oracle Spatial have their own implementations.
  • GiST and SP-GiST — generalized search trees in Postgres; pluggable operator classes implement R-tree, range types, trigram, ltree, and more.
  • BRIN (Block Range Index, Postgres 9.5) — stores min/max per range of table blocks; tiny, useful for naturally-ordered append-only data (event logs, time-series).
  • Bloom filter (Burton Bloom 1970) — probabilistic set membership with one-sided false positives. Used inside LSM SSTables to skip files, and as a Postgres bloom extension index for many-column equality predicates.
  • Covering / index-only scans — when every column the query needs lives inside the index, the executor never visits the heap. Postgres calls these “INCLUDE columns”; SQL Server calls them “included columns”.

3.3 Selectivity and cardinality

The optimizer’s decision to use an index hinges on selectivity — what fraction of the table satisfies the predicate. Statistics (histograms, most-common-value lists, n_distinct, correlation) are maintained by ANALYZE in Postgres, by dbms_stats in Oracle, and by automatic update in SQL Server. Outdated stats are one of the most common sources of catastrophic plan regressions.

4. MVCC — Multi-Version Concurrency Control

MVCC was invented by Reed (MIT TR-205, 1978) and popularized in commercial systems by Oracle in the early 1980s. Its core idea: readers never block writers and writers never block readers, because every modification creates a new version of the row rather than overwriting the old one. Each transaction reads from a consistent snapshot defined by a transaction ID or timestamp.

4.1 PostgreSQL implementation

Every tuple carries xmin (the transaction ID that inserted it) and xmax (the transaction ID that deleted or updated it). A transaction’s visible set is computed from its snapshot: a tuple is visible if its inserting txn is committed and was committed-before-snapshot, and its deleting txn is not committed-before-snapshot. Updates are implemented as xmax on the old tuple + a new tuple with fresh xmin — old versions live in the heap until VACUUM reclaims them. Long-running transactions hold back the “oldest-visible-xmin” horizon and prevent VACUUM from reclaiming dead tuples, causing bloat.

4.2 Other implementations

  • Oracle — old versions live in UNDO segments; the heap always holds the latest row; readers reconstruct the snapshot by walking UNDO.
  • InnoDB — same shape as Oracle: clustered B+-tree of current rows + an undo log in the system tablespace; UNDO is also used for rollback.
  • SQL Server — historically lock-based; “READ_COMMITTED_SNAPSHOT” and “SNAPSHOT” levels added MVCC via the version store in tempdb (2005).
  • Spanner — every row carries a TrueTime commit timestamp; reads at a timestamp T see the version with the largest commit_ts ≤ T. The timestamp-based MVCC plus paxos groups give external consistency.

4.3 vs lock-based concurrency

Two-phase locking (Eswaran et al. 1976) is the classic alternative: readers take shared locks, writers take exclusive locks, and locks are held until commit. It is simpler to reason about for some workloads, but readers and writers serialize. Most modern engines default to MVCC and use locking only for write-write conflicts and for explicit SELECT ... FOR UPDATE.

5. ACID and isolation levels

The acronym (Härder + Reuter 1983) describes four guarantees:

  • Atomic — a transaction either fully commits or fully aborts. Implemented via the WAL: redo records replay committed work after a crash; undo records roll back uncommitted work.
  • Consistent — application-level invariants and database constraints hold before and after the transaction. The DBMS enforces declared constraints (foreign keys, CHECK, NOT NULL, UNIQUE); the application is responsible for the rest.
  • Isolation — concurrent transactions appear to execute in some serial order, modulated by the chosen isolation level (below).
  • Durable — once COMMIT returns success, the transaction survives a crash. Achieved by fsync of the WAL to local storage and / or by synchronous replication to a quorum.

5.1 The four ANSI SQL levels

The standard defines isolation in terms of three classical anomalies (dirty read, non-repeatable read, phantom read):

LevelDirty readNon-repeatablePhantomCommon defaults
Read UncommittedPossiblePossiblePossiblerare in practice
Read CommittedPreventedPossiblePossiblePostgreSQL, Oracle, SQL Server
Repeatable ReadPreventedPreventedPossibleMySQL InnoDB
SerializablePreventedPreventedPreventedstrict workloads

The ANSI definitions are written against locking implementations and have known holes; Berenson, Bernstein, Gray, Melton, O’Neil, and O’Neil’s 1995 SIGMOD paper “A Critique of ANSI SQL Isolation Levels” formalizes the gaps and introduces snapshot isolation.

5.2 Snapshot isolation and SSI

Snapshot isolation (SI) — a txn reads from a consistent snapshot taken at its start and commits only if no other concurrent txn wrote any of the same rows (the first-committer-wins rule, sometimes implemented as first-updater-wins via row locks). SI prevents the three ANSI anomalies but permits write skew: two transactions read overlapping sets and each updates a disjoint row, jointly violating an invariant. Berenson’s canonical example is the on-call doctors example.

Serializable Snapshot Isolation (SSI) (Cahill, Röhm, Fekete, SIGMOD 2008) adds runtime tracking of read-write dependencies on top of SI and aborts one transaction in a cycle. Postgres ships SSI as SERIALIZABLE since 9.1. CockroachDB implements its own SSI variant.

5.3 Read phenomena beyond ANSI

  • Lost update — two txns read, modify, and write back the same row; the later write overwrites the earlier.
  • Read skew — within a txn the data drifts: read A, then read B after B was updated by another committed txn.
  • Write skew — described above.
  • Phantom write / G2-item — discussed in Adya, Liskov, O’Neil (ICDE 2000) for “Generalized Isolation Level Definitions” which replaces the ANSI anomaly catalog with implementation-agnostic definitions.

6. Write-ahead log (WAL)

The write-ahead log rule is: any change to a page must be recorded in the log and the log must be flushed to durable storage before the modified page itself can be written back. The rule was formalized for industrial systems by ARIES (Mohan, Haderle, Lindsay, Pirahesh, Schwarz, TODS 1992), which is still the template every relational engine follows.

6.1 What the WAL stores

Two complementary records per change:

  • Redo — enough information to re-apply the change to a page in an old state.
  • Undo — enough information to back out the change for a transaction rollback or to roll back uncommitted work after a crash.

PostgreSQL’s WAL stores logical-physical (XLogRecord) entries with full-page images for the first change to a page after each checkpoint. InnoDB has a redo log (ib_logfile) and a separate undo log (in the system tablespace). Oracle has the online redo log and undo segments. SQL Server has the transaction log (.ldf).

6.2 Recovery

ARIES recovery has three phases: analysis (replay the log forward from the last checkpoint to determine which transactions were in flight and which pages were dirty), redo (replay all logged changes — including those of uncommitted transactions — to bring pages forward), and undo (use the undo log to roll back the still-uncommitted transactions). The trick is that redo is physiological — log records describe page-level changes in terms of slot offsets, so applying them is idempotent and order-preserving.

6.3 Group commit and fsync batching

fsync is expensive (one disk round-trip or one NVMe sync IO). To preserve throughput, engines batch many transactions’ commit records and issue one fsync, returning success to all of them at once. This is group commit. The cost is a small latency increase for individual transactions; the win is sometimes 10× to 50× more commits per second.

7. Query execution

7.1 The compilation pipeline

SQL text → tokenizer → parser → AST → analyzer (binder) → logical plan → optimizer → physical plan → executor.

The logical plan is a relational-algebra tree (Project, Filter, Join, Scan, Aggregate). The optimizer applies rewrites (predicate pushdown, projection pushdown, subquery flattening, join reordering) and then chooses a physical plan operator for each logical operator using cost estimates.

7.2 Physical operators

  • ScansSeqScan (read every page), IndexScan (descend a B-tree and fetch heap tuples), IndexOnlyScan (no heap visit), BitmapHeapScan (build a bitmap of TIDs from one or more indexes, then visit the heap in physical order).
  • Filter — apply a residual predicate to streaming tuples.
  • Project — drop or compute output columns.
  • Sort — external merge sort with sort-area memory; spills to disk when the input exceeds work_mem (PG) / sort_buffer_size (MySQL).
  • Aggregate — hash aggregate (build a hash table keyed on group columns) or sort-aggregate (require sorted input).
  • Joins — see 7.4.
  • Limit, WindowAgg, Materialize, Memoize / cache — auxiliary operators.

7.3 Cost-based optimization (CBO)

The optimizer enumerates a search space of plan trees and picks the one with the lowest estimated cost. Costs are derived from estimated cardinalities at each operator, which in turn come from statistics on base tables and from selectivity formulas for predicates. Classic techniques include System R-style dynamic programming (Selinger et al. SIGMOD 1979) for left-deep join trees, and Volcano / Cascades (Graefe 1995) for richer search spaces with rule-based equivalence and enforcer operators (used by SQL Server, CockroachDB, and Apache Calcite).

7.4 Joins in detail

  • Nested-loop join — for each outer row, scan the inner. Cost O(N · M) without an index, O(N · log M) when the inner has an index on the join key. Best when the outer is tiny.
  • Hash join — build a hash table on the smaller side (“build”), probe with rows from the larger side. Cost O(N + M) assuming the build fits in memory. Spills to disk via Grace / hybrid-hash for large builds.
  • Merge join (sort-merge) — sort both sides on the join key, then walk them in lockstep. Cost O(N log N + M log M) plus a linear merge. Best when inputs are already sorted (e.g., produced by an index scan) and the inputs are large.

Modern optimizers choose adaptively based on cardinality estimates and sometimes re-plan mid-query (Oracle adaptive plans, SQL Server adaptive joins, Postgres’ Memoize node for nested-loop inner caching).

7.5 Execution models

  • Iterator (Volcano) model — each operator exposes open / next / close and next returns one tuple. Simple, generic, high per-tuple overhead.
  • Vectorized executionnext returns a batch (vector) of values per column. Pioneered by MonetDB/X100 (Boncz, Zukowski, Nes; CIDR 2005) and used by ClickHouse, DuckDB, Snowflake’s execution engine, and Meta’s Velox library. Cache-friendly and amenable to SIMD.
  • Compiled / push model — Hyper / Umbra (Neumann, VLDB 2011 “Efficiently Compiling Efficient Query Plans for Modern Hardware”) generate LLVM IR per query. CockroachDB has a vectorized engine; Spark has Tungsten code gen.

8. Buffer pool and cache

The buffer pool is the in-memory cache of disk pages. It is shared across all backends and is usually the single largest resource sized in DB configuration (shared_buffers in Postgres, innodb_buffer_pool_size in MySQL, buffer pool size in SQL Server, db_cache_size in Oracle).

Page replacement policies in practice:

  • LRU — classic least-recently-used. Easy to implement, prone to scan pollution (a single sequential scan can evict the entire working set).
  • CLOCK / second-chance — approximation of LRU with a reference bit and a rotating hand. Postgres uses a CLOCK variant.
  • LRU-K (O’Neil et al. 1993) — tracks the k-th most recent reference; less scan-vulnerable.
  • 2Q (Johnson, Shasha 1994) — splits hot and cold queues. MySQL InnoDB uses a midpoint-insertion LRU which is essentially 2Q.
  • ARC (Megiddo, Modha 2003) — adaptive replacement; used in WiredTiger.

Many engines warm the buffer pool on restart by snapshotting / replaying a list of “hot” page IDs (innodb_buffer_pool_dump_at_shutdown in MySQL, pg_prewarm in Postgres).

9. Replication

9.1 Physical replication

The replica receives raw block-level WAL records and applies them identically. The replica is byte-for-byte the same as the primary at any given LSN.

  • PostgreSQL streaming replication + physical hot-standby (since 9.0).
  • Oracle Data Guard.
  • MySQL Group Replication’s certification mode is logical, but block-level shipping is available via storage-level mirroring.
  • SQL Server Always On Availability Groups.

Strengths: fast, low overhead. Weaknesses: replica must be the same major version and same architecture; cannot replicate a subset of tables.

9.2 Logical replication

The primary emits row-level change events (insert / update / delete tuples with column values). The replica replays them through the SQL layer.

  • MySQL binlog in ROW format (default since 5.7).
  • PostgreSQL logical decoding (since 9.4) and built-in logical replication (since 10) via publications and subscriptions.
  • Debezium + Kafka Connect sources from binlog / logical-decoding to Kafka topics, enabling CDC pipelines.
  • Oracle GoldenGate.

Strengths: cross-version, cross-platform, partial schemas; foundation of CDC pipelines, ETL into warehouses, zero-downtime schema migrations. Weaknesses: higher CPU, lag-prone for huge transactions, schema-drift risks.

9.3 Sync, async, semi-sync

  • Asynchronous — commit returns as soon as the primary’s local WAL is flushed. Replicas may lag. Risk: data loss on primary failure equal to the lag window.
  • Synchronous — commit waits for at least one replica to acknowledge receipt (or, stricter, apply). Higher latency; zero data loss.
  • Semi-synchronous (MySQL) — primary waits for replica receipt but not apply; bounded data loss.

PostgreSQL’s synchronous_commit ranges from off to remote_apply. RDS Multi-AZ and Aurora use synchronous block-level replication to the standby AZ.

9.4 Failover and orchestration

  • Patroni (Postgres) and Repmgr — leader election + automated failover, typically backed by etcd / Consul / ZooKeeper.
  • MHA and Orchestrator (Vitess / Shlomi Noach) — MySQL.
  • AWS RDS Multi-AZ — managed failover via attached EBS replication.

10. Sharding and partitioning

10.1 Partitioning (single node)

Splitting one logical table into many physical tables to keep individual indexes manageable and to enable partition pruning at planning time.

  • Range — by date or numeric range (most common for time-series).
  • List — by enumerated value (e.g., region).
  • Hash — uniform spread for OLTP; pays off when range pruning isn’t needed.

Postgres declarative partitioning (since 10), MySQL native partitioning, Oracle composite partitioning, SQL Server partition functions / schemes.

10.2 Sharding (across nodes)

Same shape as partitioning but on different machines, with cross-node routing in front.

  • Vitess (YouTube origin, now CNCF) — proxies in front of MySQL shards; reshards via VReplication.
  • Citus (now part of Microsoft) — distributed Postgres extension; reference + distributed tables with shard placement.
  • CockroachDB, TiDB, YugabyteDB — natively distributed, automatic range-based sharding, Raft per range; no application-visible shard key required.

10.3 Cross-shard transactions

Crossing shards requires distributed atomicity, traditionally via two-phase commit (Gray 1978): a coordinator collects prepare votes from each participant, then commits or aborts. 2PC is blocking on coordinator failure and adds latency.

Calvin (Thomson, Diamond, Weng, Ren, Shao, Abadi; SIGMOD 2012) replaces 2PC with deterministic transaction ordering across all nodes: every replica agrees on a transaction order via a sequencer, then executes the same order deterministically. FaunaDB and FoundationDB borrow from this lineage.

11. Specific engines

PostgreSQL

MVCC with xmin / xmax on heap tuples; B+-tree default index AM plus GiST, SP-GiST, GIN, BRIN, hash; pluggable table access methods (since 12) with zheap and other heap alternatives in development. WAL with full-page writes at checkpoint boundaries. Streaming + logical replication. Extension ecosystem: PostGIS (spatial), pgvector (HNSW + IVFFlat for vector search), TimescaleDB (time-series + hypertables + continuous aggregates), Citus (sharding), pgRouting, pg_partman, pg_stat_statements.

MySQL / MariaDB

InnoDB B+-tree clustered primary key; MVCC via undo log; row-level locking on the primary index; redo log (ib_logfile) for crash recovery; binary log (binlog) for replication and PITR. MyISAM (legacy) is non-transactional. Group Replication and InnoDB Cluster for multi-primary write-set certification; MariaDB Galera for synchronous multi-primary.

SQLite

Single-file, embedded; B-tree both for tables (clustered on rowid or PK) and for indexes. Two journaling modes: rollback journal (legacy) and WAL (since 3.7). Excellent durability and concurrency for an embedded engine; serializable isolation by default.

Oracle

Block-based heap with clustered IOTs as an option; UNDO segments give MVCC and rollback; SCN (System Change Number) drives global ordering; online redo logs + archive logs. Data Guard for replication; RAC for shared-disk clustering with cache fusion across instances.

SQL Server

Clustered B+-tree (clustered index) by default; non-clustered indexes carry the clustering key as the row locator; tempdb hosts the version store for SI / RCSI; transaction log for recovery; Always On Availability Groups for HA / DR. Hekaton (Diaconu et al. SIGMOD 2013) is the in-memory OLTP engine — latch-free Bw-tree, native compilation of stored procs. Columnstore indexes give DW-style columnar storage in the same engine.

MongoDB

Document store on top of WiredTiger (Keith Bostic et al., acquired 2014). WiredTiger supports both LSM and B-tree page-tree backends; B-tree is the default. Replica sets use a custom oplog (logical log of operations); sharded clusters use mongos routers + config servers.

Cassandra and ScyllaDB

LSM SSTables on every node; tunable consistency per operation (ONE, QUORUM, ALL, etc.); peer-to-peer via gossip + consistent hashing; no leader. ScyllaDB is a Seastar-based, shared-nothing, per-core C++ rewrite of Cassandra with the same protocol but better per-node throughput.

DynamoDB

Managed, partitioned LSM-like storage; hash + optional range key partitioning; per-table provisioned or on-demand capacity. DynamoDB Streams (CDC) and Global Tables (multi-region active-active). Inspired by the Dynamo paper (DeCandia et al., SOSP 2007) but the production service is a substantially different system.

Spanner

Bigtable-style rows stored in tablets, each replicated by a Paxos group across zones / regions; TrueTime API exposes time as an interval [earliest, latest] with bounded uncertainty, enabling externally consistent transactions via commit-wait. Described in Corbett et al., “Spanner: Google’s Globally-Distributed Database”, OSDI 2012, plus the 2017 SIGMOD paper on Spanner SQL.

CockroachDB

Distributed SQL on top of Pebble (LSM, Go-native RocksDB replacement); data sharded into 64-MiB ranges, each replicated by Multi-Raft; distributed SQL with vectorized execution; SSI isolation; HLC (Hybrid Logical Clock) instead of TrueTime.

TiDB

Distributed SQL on top of TiKV (RocksDB + Raft) for OLTP rows + TiFlash (columnar, ClickHouse-derived) for OLAP, with Raft Learner-based replication into the columnar replica for HTAP queries on fresh data.

YugabyteDB

DocDB storage layer (RocksDB + Raft); Postgres wire protocol on the YSQL side; Cassandra wire protocol on the YCQL side; sharding via tablets and automatic split.

SingleStore (MemSQL)

Universal storage: rows-in-memory + columnar-on-disk in the same table; distributed across aggregator and leaf nodes; vectorized execution. HTAP focus.

12. OLAP and columnar engines

Column stores rearrange data so that all values of a single column lie next to each other. The benefits cascade: queries that touch a few columns of a wide table only read those columns; per-column compression hits dramatic ratios; vectorized execution operates on tight loops over arrays of values; late materialization defers tuple reconstruction until after filters are applied.

12.1 Cloud warehouses

  • Snowflake (2014 launch; Dageville et al. SIGMOD 2016) — fully-disaggregated storage (S3 / blob), virtual warehouses (compute clusters), micro-partition format, automatic clustering.
  • Google BigQuery — Capacitor columnar format on Colossus; Dremel-style serving tree execution.
  • Amazon Redshift — derived from ParAccel; massively-parallel, zone-mapped columnar; RA3 nodes separate compute and managed storage.
  • Databricks SQL + Delta Lake — Parquet + transaction log over object storage; Photon vectorized C++ engine.
  • Apache Iceberg + Trino / Presto — table format with manifest / snapshot files atop Parquet / ORC, queried by Trino, Spark, or Flink.

12.2 Open-source columnar OLTP+OLAP

  • ClickHouse (Yandex 2009, open-source 2016) — MergeTree family (LSM-ish columnar with background merge); vectorized execution; tight C++.
  • DuckDB (Mark Raasveldt and Hannes Mühleisen, CIDR 2019) — embedded columnar engine; vectorized; runs anywhere SQLite runs.
  • Apache Druid — real-time ingestion + segmented columnar; time-series
    • OLAP cubes; bitmap and roaring-bitmap indexes.
  • Apache Pinot — similar real-time OLAP; star-tree index; LinkedIn origin.
  • MotherDuck — managed DuckDB-as-a-service with hybrid local-cloud execution.

12.3 Common columnar techniques

  • Dictionary encoding — per-column dictionary of distinct values, integer codes in the data; cheap for low-cardinality columns.
  • Run-length encoding (RLE) — efficient for sorted columns.
  • Bit-packing + delta encoding — for integer columns with small ranges.
  • General-purpose codecs — ZSTD, LZ4, Snappy on top of the encoded data.
  • Zone maps / min-max indexes — per-row-group min / max, allowing block skipping.
  • Bloom filters for high-cardinality equality lookups.
  • Late materialization — defer assembling tuples; carry column vectors through filters / joins, only emit final tuple shape once.

13.1 Disaggregated compute and storage

Snowflake (2014) and Amazon Aurora (SIGMOD 2017, “Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases”) pioneered separating the database compute layer from the storage layer: storage is a distributed, log-structured, multi-tenant service; compute nodes are stateless and can scale independently. Snowflake, Aurora, PlanetScale (Vitess + Boost), Neon (Postgres on serverless storage), and Databricks Lakehouse all share this shape.

13.2 Serverless

  • Aurora Serverless v2 — per-second autoscaling of an Aurora compute instance.
  • Neon — Postgres compute that detaches from storage; branching of databases as copy-on-write snapshots; cold-start measured in seconds.
  • PlanetScale — managed Vitess + branch-and-deploy for MySQL; later added the Boost caching tier.
  • Turso — managed libSQL (SQLite fork) at the edge.

13.3 Vector databases for RAG

Embedding-based similarity search has become a first-class workload:

  • pgvector — Postgres extension with HNSW + IVFFlat indexes.
  • Pinecone, Weaviate, Qdrant, Chroma, Milvus, LanceDB — dedicated vector engines, mostly HNSW-based; some add hybrid sparse + dense retrieval, filtering, and tenant isolation.
  • Elasticsearch / OpenSearch also ship dense-vector fields and HNSW indexes alongside their inverted index.

13.4 HTAP

Hybrid OLTP + OLAP systems run transactional and analytical workloads on the same engine, often by keeping a row store for writes and a columnar replica for scans:

  • TiDB / TiFlash — Raft Learner replicates TiKV rows into TiFlash columnar.
  • SingleStore — single engine with both row and column tables.
  • MariaDB ColumnStore — columnar engine plugin on the same server.

13.5 SQL on everything

  • Trino / Presto — federated SQL on Iceberg, Hive, Delta Lake, JDBC, S3, Kafka.
  • DuckDB queries Parquet, CSV, Arrow, S3, Iceberg, Delta directly.
  • Apache Spark SQL — SQL on the Spark DataFrame engine.
  • Polars — Rust columnar DataFrame engine with SQL frontend.

14. Common pitfalls

  • N+1 query — a loop in the application layer fires one SELECT per parent row. Fix by eager-loading via JOIN or IN (...), or by batching at the ORM (Hibernate @BatchSize, Sequelize include, Django prefetch_related, ActiveRecord includes).
  • Missing indexes on FK and WHERE columns — every column used in equality or range predicates of frequent queries deserves an index. MySQL notably does not auto-create indexes for FKs in InnoDB (it does enforce them, which still requires lookups).
  • Index bloat in Postgres — repeated updates leave many dead tuples; the index entries still point at them until VACUUM runs. Monitor pg_stat_user_tables and pgstattuple; REINDEX CONCURRENTLY to rebuild.
  • Long-running transactions — they hold back the MVCC horizon, prevent VACUUM, and bloat undo / version stores. Set timeouts (idle_in_transaction_session_timeout in PG).
  • Connection pool exhaustion — every backend takes RAM; running out of connections is a top-N production incident. Use PgBouncer / Odyssey / AWS RDS Proxy in front of Postgres; in MySQL use ProxySQL or AWS RDS Proxy. Application-side pool sized below the DB limit.
  • Auto-incrementing PK contention — every insert hits the rightmost page of the clustered B+-tree, becoming a hotspot under high write rate. Mitigations: UUIDv7 / ULID for time-sortable but distributed keys; hash partitioning; sequence caching.
  • Hot-row contention — repeated writes to the same row under MVCC create long undo chains; under locking they serialize. Shard the counter or use HyperLogLog / approximate counters.
  • Cross-shard joins — expensive; require shipping data across the network. Co-locate related tables on the same shard key (Vitess keyspace, Citus colocation_id) or denormalize.
  • Schema migrations on huge tables — naive ALTER TABLE rewrites and locks. Use pt-online-schema-change or gh-ost for MySQL, pg_repack for Postgres, or schema versioning patterns (Reshape, Skeema). Spanner and CockroachDB run online schema changes natively via background backfill.
  • Plan regressions after stats updates — major statistics refreshes can flip plan choices catastrophically. Use plan freezing (Oracle SQL Plan Management), pg_hint_plan, or query store baselines (SQL Server).
  • Replication lag — read-your-writes on a read replica is not guaranteed under async replication. Either route critical reads to the primary, use causal-consistency tokens (Aurora session tokens, Spanner bounded-staleness reads), or wait on a known LSN.
  • Backup and PITR gaps — file-system snapshots without log archives are inconsistent. Always pair pg_basebackup / xtrabackup with continuous WAL / binlog archiving and verify restores periodically.

15. Cross-references

  • distributed-systems-fundamentals — quorum, CAP / PACELC, and replication consistency models referenced throughout sections 4, 9, and 10.
  • consensus-protocols — Paxos and Raft, used inside Spanner, CockroachDB, TiKV, YugabyteDB, and Patroni.
  • _index — Compute domain index and other Tier-1 references.

16. Citations

  • Bayer, R. and McCreight, E., “Organization and Maintenance of Large Ordered Indices”, Acta Informatica 1, 1972 — original B-tree paper.
  • O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E., “The Log-Structured Merge-Tree (LSM-Tree)”, Acta Informatica 33, 1996.
  • Guttman, A., “R-Trees: A Dynamic Index Structure for Spatial Searching”, SIGMOD 1984.
  • Bloom, B. H., “Space/time trade-offs in hash coding with allowable errors”, CACM 1970.
  • Selinger, P. G., et al., “Access Path Selection in a Relational Database Management System”, SIGMOD 1979 — System R optimizer.
  • Eswaran, K. P., Gray, J. N., Lorie, R. A., Traiger, I. L., “The Notions of Consistency and Predicate Locks in a Database System”, CACM 1976 — two-phase locking.
  • Härder, T., Reuter, A., “Principles of Transaction-Oriented Database Recovery”, ACM Computing Surveys 1983 — coined the ACID acronym.
  • Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging”, ACM TODS 1992.
  • Berenson, H., Bernstein, P., Gray, J., Melton, J., O’Neil, E., O’Neil, P., “A Critique of ANSI SQL Isolation Levels”, SIGMOD 1995.
  • Cahill, M. J., Röhm, U., Fekete, A. D., “Serializable Isolation for Snapshot Databases”, SIGMOD 2008.
  • Adya, A., Liskov, B., O’Neil, P., “Generalized Isolation Level Definitions”, ICDE 2000.
  • Graefe, G., “The Cascades Framework for Query Optimization”, IEEE Data Engineering Bulletin 1995.
  • Boncz, P., Zukowski, M., Nes, N., “MonetDB/X100: Hyper-pipelining Query Execution”, CIDR 2005 — vectorized execution.
  • Neumann, T., “Efficiently Compiling Efficient Query Plans for Modern Hardware”, VLDB 2011.
  • Corbett, J. C., et al., “Spanner: Google’s Globally-Distributed Database”, OSDI 2012; Bacon, D. F., et al., “Spanner: Becoming a SQL System”, SIGMOD 2017.
  • DeCandia, G., et al., “Dynamo: Amazon’s Highly Available Key-value Store”, SOSP 2007.
  • Diaconu, C., et al., “Hekaton: SQL Server’s Memory-Optimized OLTP Engine”, SIGMOD 2013.
  • Thomson, A., Diamond, T., Weng, S.-C., Ren, K., Shao, P., Abadi, D. J., “Calvin: Fast Distributed Transactions for Partitioned Database Systems”, SIGMOD 2012.
  • Verbitski, A., et al., “Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases”, SIGMOD 2017; and “Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes”, SIGMOD 2018.
  • Dageville, B., et al., “The Snowflake Elastic Data Warehouse”, SIGMOD 2016.
  • Raasveldt, M., Mühleisen, H., “DuckDB: an Embeddable Analytical Database”, CIDR 2019.
  • Levandoski, J. J., Lomet, D. B., Sengupta, S., “The Bw-Tree: A B-tree for New Hardware Platforms”, ICDE 2013.
  • Kleppmann, M., “Designing Data-Intensive Applications”, O’Reilly 2017 — Chapters 3 (Storage and Retrieval), 5 (Replication), 6 (Partitioning), 7 (Transactions).
  • Hellerstein, J. M., Stonebraker, M., Hamilton, J., “Architecture of a Database System”, Foundations and Trends in Databases 2007.
  • RocksDB documentation, “RocksDB Wiki” — github.com/facebook/rocksdb.
  • Pebble documentation — github.com/cockroachdb/pebble.
  • WiredTiger architecture guide — source.wiredtiger.com.
  • PostgreSQL documentation, “Chapter 70: Index Access Method Interface Definition” and “Chapter 73: Database Physical Storage”.
  • MySQL Reference Manual, “Chapter 17: The InnoDB Storage Engine”.