Vector Database Taxonomy
A working catalog of the production vector-search and vector-database landscape circa 2026. The category exists to answer one question: given a query vector (a numeric embedding of a query produced by a model), find the k closest stored vectors in a corpus of millions to billions, in milliseconds, with controllable accuracy. That’s it. Everything else — sparse retrieval, hybrid search, filtering, multi-vector, named vectors, GPU index build, object-storage tiers — is layered on top of that core primitive.
The selection axes that matter when picking a vector store:
- Latency (p99) at recall — recall@10 ≥ 0.95 at p99 < 100 ms is the standard production target. The trade-off curve is governed primarily by the index algorithm and its tuning parameters.
- Index build time and reindexing cost — a 10M-vector HNSW index typically takes 10 minutes to 2 hours to build; rebuilding daily is expensive. Some stores support incremental insertion (Pinecone, Qdrant, Weaviate, Milvus); some require periodic rebuilds (early IVF setups).
- Memory footprint — naive HNSW holds the entire graph + vectors in RAM. A 10M × 768-dim float32 corpus is ~30 GB of raw vectors before the graph. Quantization (PQ, SQ, binary) and disk-backed variants (DiskANN, Vespa, Vespa) cut this by 4× to 32×.
- Insert / delete / update throughput — most use cases have continuous insertion; many have hard delete (GDPR right-to-be-forgotten) and update; tombstone handling and consolidation matters.
- Filter expressiveness and post-filter vs pre-filter strategy — most queries combine a vector similarity with structured filters (“similar documents where
tenant_id = 42 and date > 2025-01-01”); pre-filtering during the graph traversal is much faster than post-filtering after retrieving 10× candidates. - Hybrid (dense + sparse) retrieval support — sparse retrieval (BM25, SPLADE) catches keyword-exact matches that dense embeddings miss; production RAG almost always combines both with reciprocal-rank fusion (RRF).
- Multi-vector and named vectors — a single document may have one vector per chunk, plus a separate title vector, plus a separate dense + sparse pair. Native support saves a lot of application complexity.
- Deployment — managed serverless (Pinecone, Turbopuffer, Pinecone Serverless) vs managed dedicated (Weaviate Cloud, Qdrant Cloud, Zilliz Cloud) vs self-hosted (Milvus, Weaviate, Qdrant, Vespa) vs embedded (Chroma, LanceDB, sqlite-vec).
- Pricing model — per-vector + queries (Pinecone classic), per-RU (Pinecone Serverless), per-CPU + memory (Qdrant Cloud, Weaviate Cloud, Milvus dedicated), per-GB object storage + queries (Turbopuffer, Lance Cloud), free + self-hosted.
ANN algorithms — the core math
All production vector search uses approximate nearest neighbor (ANN) search, not exact search. The accuracy is tunable; the speed-up vs exact is typically 100× to 10,000×. Three families dominate:
Graph-based: HNSW and successors
- HNSW (Hierarchical Navigable Small World) — Yury Malkov + Dmitry Yashunin 2018 (improving on Malkov’s earlier NSW). The dominant graph index. Multi-layer skip-list-like graph: top layers sparse for fast routing, bottom dense for fine-grained search. Build parameters:
M(max edges per node, typically 16-64),efConstruction(search-list size during build, 100-500). Query parameters:ef(search-list size at query time, 50-500). Memory-hungry — must hold graph + vectors in RAM for best performance. In essentially every vector DB: Pinecone, Weaviate, Qdrant, Milvus, Elasticsearch, OpenSearch, Redis, Vespa, MongoDB Atlas, Cassandra, PostgreSQL (pgvector since 0.5). - HNSWlib — Malkov’s own reference C++ implementation; Apache 2.0; widely embedded.
Inverted-file / partitioning: IVF family
- IVF (Inverted File Index) — Hervé Jégou et al. INRIA 2011 (paper “Product Quantization for Nearest Neighbor Search”). K-means clusters the corpus into
nlistpartitions (typicallynlist ≈ sqrt(N)); query visitsnprobenearest partitions. Faster index build than HNSW, less memory if combined with PQ; coarser recall trade-off. - IVF-PQ — IVF + Product Quantization (decompose each vector into M sub-vectors, quantize each sub-vector to a codebook entry — typically compresses 4× to 32× with controllable accuracy loss). Mainstay for billion-scale corpora.
- IVF-SQ — IVF + Scalar Quantization (float32 → int8); 4× compression, very small accuracy loss.
- Used in FAISS (Meta’s library), Milvus, Vespa, and as a fallback option in others.
Disk-friendly: DiskANN, SPANN, SCaNN
- DiskANN — Microsoft Research 2019 (Subramanya et al.). SSD-backed graph index — keeps the graph compressed in RAM but vectors on SSD. Enables billion-scale on a single machine with modest RAM. Used by Pinecone, Milvus (as alternative index), Vespa, and Microsoft Bing.
- SPANN — Microsoft Research 2021 (Chen et al.). Hybrid memory + disk; centroid-based partitioning on SSD with fast routing in RAM. Successor to DiskANN at Microsoft scale.
- SCaNN (Scalable Nearest Neighbors) — Google 2020 (Guo et al.; “Accelerating Large-Scale Inference with Anisotropic Vector Quantization”). Anisotropic PQ that optimizes for inner-product loss specifically. Powers Google Vertex AI Vector Search (formerly Matching Engine) and Vespa.
Other notable algorithms
- Annoy (Approximate Nearest Neighbors Oh Yeah) — Erik Bernhardsson at Spotify 2013. Random-projection forest. Light, embedded-friendly, immutable after build. Used in Spotify recommendations and many ad-hoc embeds.
- NMSLIB — academic library (Bilegsaikhan Naidan + Leonid Boytsov); HNSW and other graphs; the precursor to HNSWlib.
- FAISS — Meta AI Research 2017 (Johnson, Douze, Jégou); the most widely-used C++/Python library for similarity search. Implementations of Flat, IVF, IVF-PQ, HNSW, OPQ, LSH, more. GPU support since launch (CUDA). Apache 2.0. Foundation for many other systems (Milvus, Weaviate’s HNSW choice before they wrote their own, internal use at most large companies).
- Pyserini — University of Waterloo IR group; BM25 + dense + hybrid retrieval research library on top of Lucene.
- Binary embeddings + binary indexes — embeddings quantized to 1 bit per dimension; Hamming distance via popcount; 32× smaller than float32; “good enough” recall for top-N retrieval that gets reranked. Cohere, Voyage, OpenAI text-embedding-3 all support binary natively. Binary HNSW (Qdrant, Vespa, Milvus 2024).
- Matryoshka representations (MRL) — Aditya Kusupati et al. (UW + Google) 2022. Train embeddings such that arbitrary truncation prefixes (e.g. first 64 of 768 dims) remain useful — adaptive trade-off between size and recall at query time, not at training time. OpenAI text-embedding-3 large/small uses MRL; Cohere Embed v3; Voyage AI; Snowflake arctic-embed.
Pure-play vector databases — vector-search as the primary product
The category that emerged 2019-2022 as the “vector-native” alternative to grafting vector search onto existing engines.
- Pinecone — Pinecone Systems 2019; founder Edo Liberty (ex-Yahoo Research scientist, ex-AWS principal scientist; previously authored the SAMG sampling algorithm and several NeurIPS/ICML papers on streaming sketching). Closed-source SaaS-only. Series A 100M Apr 2023 at $750M valuation** (Andreessen Horowitz lead), continued growth through 2024. Pinecone Serverless v3 (Jan 2024) — separated storage (object storage) from compute (query nodes); 50× cost reduction vs original “pods” model for most workloads; pricing per RU (read unit) + WU (write unit) + storage GB. Pinecone Inference (2024) — hosted embedding models (BGE, Llama, Cohere) so applications don’t need a separate embeddings service. Index types: pod-based (legacy:
p1,s1,p2— HNSW-derivative) + serverless (DiskANN derivative). Supports metadata filtering, hybrid (sparse + dense), namespaces (tenant isolation), pod replicas (HA). Production tier supports thousands of QPS. - Weaviate — SeMI Technologies 2019 (Bob van Luijt + Etienne Dilocker; Amsterdam); BSD-3-Clause + commercial Weaviate Cloud Services. Rebranded to Weaviate B.V. Apache 2.0 OSS core + commercial cloud. GraphQL API as primary query interface (REST as secondary), with vector + structured + hybrid filters. HNSW index. Modules: text2vec-openai, text2vec-cohere, text2vec-huggingface, multi2vec-clip, qna-openai, generative-openai — embeddings + RAG inline. Named vectors (since v1.24, 2024) — multiple vectors per object indexed independently (e.g., title embedding + body embedding + thumbnail embedding). Multi-tenancy with per-tenant collections (since v1.20). **Series B 200M+ valuation). Weaviate Embedded (in-process Python library, runs Weaviate as a subprocess for development).
- Qdrant — Qdrant Solutions GmbH 2021 (founder Andrey Vasnetsov, German-based, ex-Wikimedia); Apache 2.0 + commercial Qdrant Cloud + Hybrid Cloud + Private Cloud. Rust core — single most-cited reason for performance benchmarks favoring Qdrant. HNSW + scalar / product / binary quantization. Sparse vectors (since v1.7 2023) + named vectors + RBAC. Multi-vector support (since v1.10) — store multiple vectors per point. GPU-accelerated index building (since v1.13 2024) for huge ingest jobs. Pricing per node CPU/RAM hours. Series A $28M Jan 2024 (Spark Capital lead). Adopted by GitHub Copilot’s documentation search and Cohere Coral.
- Milvus + Zilliz Cloud — Zilliz Inc. 2019 (founder Charles Xie, ex-Oracle, ex-Tencent); Milvus is Apache 2.0 OSS, graduated CNCF project (Jun 2024 — graduating from incubation to top-level, joining Kubernetes and Envoy as fully-graduated). Milvus 2.x has cloud-native architecture: stateless coordinator services + worker nodes + object-storage data lake (S3/MinIO) + meta store (etcd) + message queue (Pulsar or Kafka). Milvus 3.0 (2024) brings disaggregated stage-storage with explicit decoupling of memory tier (hot) vs object storage (cold) and tiered query routing. Multiple index types in one schema (FLAT, IVF_FLAT, IVF_SQ8, IVF_PQ, HNSW, DiskANN, GPU_IVF_FLAT, GPU_IVF_PQ, SCANN). Zilliz Cloud is the managed offering — multi-region, SOC 2, BYOC option. Series B $60M 2022.
- Vespa — Yahoo from 2003 (search engine for Yahoo.com), open-sourced 2017; spun out as independent Vespa.ai Oct 2023 (CEO Jon Bratseth, the original Yahoo Vespa architect); Apache 2.0 + commercial Vespa Cloud. Combines structured search + full-text search + vector search + tensor compute in one engine —
tensoris a first-class field type, not bolted-on. Query languageYQLblends SQL-like syntax with tensor math. HNSW vector index with built-in support for tensor multi-vector (the colBERT late-interaction pattern of one vector per token, used in colBERT and colPali, is natively expressible). Used at Yahoo Mail (16PB index), Spotify Wrapped, OkCupid, BNP Paribas. - LanceDB — LanceDB Inc. 2022 (founders Chang She + Lei Xu, ex-Cloudera + ex-Tubi; both prior contributors to pandas + Apache Arrow). Apache 2.0 + commercial. Built on the Lance columnar file format (Apache 2.0; Lance is to vector data what Parquet is to columnar analytics — random-access friendly, multi-modal). Rust core, Python + JS + Rust + TS bindings. Embedded-by-default — runs in-process like SQLite, scales out via LanceDB Cloud. Versioning + time-travel built-in. Series A $11M seed Feb 2023; growth 2024. Used by Character.AI, Midjourney’s data pipeline, Harvey AI.
- Chroma — Chroma Inc. 2022 (founders Jeff Huber + Anton Troynikov, ex-Standard Cyborg + Meta Reality Labs). Apache 2.0 OSS. AI-native, developer-first — most-used vector DB in early prototypes and YC Demo Days. In-process embedded by default; client/server mode for production; Chroma Cloud (managed, 2024 launched, GA 2025). Python + JS clients. Designed for the developer who doesn’t want to think about indexes — sane defaults, HNSW under the hood, metadata filters.
- Marqo — Marqo AI 2022 (Australian); open-source + cloud. CLIP integration by default — multimodal (text + image) search out of the box without users having to wire up an embedding service.
- MyScale — MyScale Inc. 2023; ClickHouse-extended with vector capabilities. SQL-native vector search; columnar storage + MPP scan + vector index; ClickHouse compatibility means anyone who knows ClickHouse can use it directly.
- Turbopuffer — Simon Hørup Eskildsen 2023 (ex-Shopify engineering; previously wrote about Shopify’s database scaling); Series A 0.10/GB-month storage + $0.01/M queries, dramatically cheaper than Pinecone for moderate-QPS workloads. Built specifically to power very large RAG corpora at moderate QPS — perfect fit for AI app companies indexing millions to billions of documents per tenant.
- Vald — Yahoo Japan Group 2018; CNCF Sandbox. HNSW + NGT (Yahoo Japan’s own graph index). Designed for Kubernetes-native deployment; sharded with auto-recovery.
- Tencent VectorDB — Tencent Cloud 2023; cloud-only product offered in Tencent Cloud for the Chinese market.
Vector search added to existing engines — the “extension” route
For most teams, the question isn’t “which standalone vector DB?” but “can I just use my existing database for this?“. The answer in 2026 is increasingly yes.
Relational
- pgvector — Andrew Kane (ex-Instacart) 2021; Postgres extension; MIT-style PostgreSQL License. The single most influential vector tool of the post-ChatGPT era. Two index types:
ivfflat(IVF, requires training set) andhnsw(since v0.5, 2023). Storesvector(dim)columns alongside any other Postgres column types — pre-filter via standardWHEREis free. v0.7 (2024) addedhalfvec(16-bit float storage, 50% smaller than float32) and sparse vector (sparsevec) and binary vector (bit-based) with cosine/Hamming/L2 distance operators. Available everywhere Postgres runs: AWS RDS Postgres, Aurora PostgreSQL, GCP Cloud SQL Postgres, Azure Database for Postgres, Neon, Supabase, Crunchy Data, Aiven, Timescale, Tembo, pgEdge. - pg_vectorize, pg_search, pg_bm25 (ParadeDB) — Postgres extensions adding embedding-job orchestration (
pg_vectorize), Tantivy-backed BM25 (pg_searchfrom ParadeDB), and full-text + vector hybrid scoring inside Postgres. - sqlite-vec — Alex Garcia 2024; the successor to sqlite-vss (Garcia’s earlier extension that wrapped FAISS but had build complexity). sqlite-vec is single-file, written in C, no external dependencies. Provides
vec0virtual tables withvectorcolumns. Available in Python, Node.js, Ruby, Go, Rust, browser via WASM. The story of “SQLite + vectors” is finally clean. - DuckDB vss extension — DuckDB Labs 2024; HNSW vector index for DuckDB’s columnar engine. In-process OLAP + vector — extremely fast for analytical RAG queries that join document scores with structured aggregations.
- MariaDB Vector — MariaDB 11.6 (2024); native
VECTORtype + HNSW. Late entrant.
Search engines
- Elasticsearch dense_vector — since 7.0 (2019), HNSW from 8.0 (Feb 2022). Combined with Elasticsearch’s filtering, BM25, faceting, aggregations. Elastic License 2.0 + Server Side Public License (since Jan 2021). ELSER (Elastic Learned Sparse EncodeR; 2023) — Elastic’s own sparse retrieval model, alternative to dense embeddings, for query-document matching without the embedding-model dependency. Sep 2024 Elastic re-added Affero GPL v3 as a licensing option, restoring a true-open-source path. Elasticsearch Cloud at Elastic.co.
- OpenSearch k-NN plugin — AWS-led OpenSearch fork of Elasticsearch (created Jan 2021 after Elastic’s relicensing). Apache 2.0. k-NN plugin supports HNSW via three backends: NMSLIB (default), FAISS (supports IVF/IVFPQ in addition to HNSW), Lucene (Lucene’s native HNSW). OpenSearch Software Foundation transferred from AWS to Linux Foundation Sep 2024 — defensive foundation play. Managed: AWS OpenSearch Service, AWS OpenSearch Serverless, Aiven for OpenSearch.
- Solr Dense Vector — Apache Solr 9.0 (2022) added
DenseVectorFieldwith HNSW (Lucene-backed). Used by Lucidworks + most Solr-shop legacy systems. - Typesense Vector — Typesense (Jason Bosco + Kishore Nallan; founded 2017); HNSW + RBAC. Typesense Cloud managed. Differentiated by ease-of-use and built-in semantic search.
- Meilisearch vector search — Meilisearch (Quentin de Quelen, Clément Renault; French; 2018) — added vector support 2023 with Arroy (their HNSW+) and Hannoy (their LMDB-backed disk index). Meilisearch Cloud.
NoSQL and managed cloud DBs
- Redis VSS (Vector Similarity Search) — Redis Stack since 2022; HNSW + FLAT brute-force; Cosine + L2 + IP distance. Redis 7.4 (2024) brings vector capabilities into core Redis. Triple-license SSPL/RSALv2/AGPLv3 since March 2024. Valkey (Linux Foundation fork of Redis pre-relicensing, Mar 2024) — vector capabilities being added.
- Cassandra Vector Search — DataStax-led contribution to Apache Cassandra 5.0 (Sep 2024) — SAI (Storage-Attached Index) extension supporting vector type with HNSW. DataStax Astra managed Cassandra exposes this as Astra DB vector. SAI Sparse Asymmetric Index variant for cross-vector-and-structured filtering.
- MongoDB Atlas Vector Search — GA 2024 (preview 2023). HNSW under the hood (Atlas-only, not OSS MongoDB). Embedded into Atlas’s aggregation pipeline as
$vectorSearch. Atlas free tier includes vector capability. - Azure Cosmos DB Vector Index — preview 2023, GA 2024. Available in Cosmos DB for NoSQL and Cosmos DB for MongoDB (vCore). DiskANN-derived index.
- Convex Vector Search — Convex (the TypeScript backend platform) added vector search 2023. Convex Vector Search GA 2024 — store vectors alongside documents in Convex; query with filter expressions.
- Couchbase Vector Search — Couchbase 7.6 (2024); FTS service extended with vector search; HNSW.
- Aerospike Vector Search — Aerospike 7.0 (2024); positioned for very low-latency vector workloads.
- SingleStore Vector — SingleStore (formerly MemSQL) added native vector indexing 2023 — HNSW + IVF; integrated with the columnar analytics engine.
Cloud-managed vector services
- AWS OpenSearch Service + Bedrock Knowledge Bases — Bedrock KB (2023) is AWS’s managed RAG pipeline: chunking + embedding (Titan or Cohere) + storage (OpenSearch Serverless, Aurora pgvector, Pinecone, MongoDB Atlas, or Redis) + retrieval + optional reranking (Cohere Rerank 3, Amazon Rerank).
- AWS Kendra — Bedrock’s older sibling; managed enterprise search with embeddings; declining vs Bedrock KB.
- Vertex AI Vector Search — Google Cloud; formerly “Matching Engine” (rebranded 2024). ScaNN-powered. Tightly integrated with Vertex AI’s embedding APIs and Gemini grounding.
- Azure AI Search vector mode — formerly Cognitive Search; vector index since 2023; integrated vectorization (Azure-managed embedding pipeline) 2024 GA. Hybrid (BM25 + vector) with semantic ranker.
- Databricks Vector Search — managed vector capability inside the Databricks Lakehouse, indexed off Delta tables; auto-sync from underlying table changes.
- Snowflake Cortex Search — Snowflake’s managed vector + RAG capability; built on the Snowflake compute model.
Embedding models — the producer side
The choice of embedding model has a larger effect on retrieval quality than the choice of vector index. Top contemporary models:
- OpenAI text-embedding-3-large (3072 dim, Matryoshka-truncatable to 256/512/1024) + text-embedding-3-small (1536 dim). Default for OpenAI-shop applications. Pricing 0.02/1M tokens (small).
- Cohere Embed v3 (1024 dim) — multilingual (100+ languages); english-only and multilingual variants; specialized search-query vs search-document variants (“input_type” parameter); int8 + binary quantization supported. Embed v4 late 2024 — multimodal text+image+document.
- Voyage AI voyage-3 / voyage-3-large (1024 dim) — Voyage AI (Tengyu Ma, ex-Stanford) 2023-2024; among the top-performing on MTEB. voyage-finance-2, voyage-law-2, voyage-code-3 domain-specific variants. Acquired by MongoDB Feb 2025 for ~$220M — embedding models bundled with MongoDB Atlas going forward.
- BAAI bge family — Beijing Academy of AI; open-weight models. bge-large-en-v1.5 (1024 dim) + bge-m3 (multilingual, 8K context, 1024 dim, supports dense + sparse + multi-vector simultaneously) — bge-m3 is unique in producing all three retrieval signals from one forward pass.
- Nomic-embed v1.5 (768 dim) — Nomic AI; open-weight (Apache 2.0); long context (8192 tokens). Nomic-embed-text-v2 (2024) brings multilingual + Matryoshka.
- Jina embeddings v3 (1024 dim) — Jina AI; multilingual; 8K context. jina-clip-v2 for multimodal.
- mxbai-embed-large (1024 dim) — Mixedbread (Sean Lee + others; German); Apache 2.0; competitive with proprietary models on MTEB.
- Snowflake arctic-embed (and arctic-embed-l-v2.0 2024) — Apache 2.0; Matryoshka-truncatable.
- Google text-embedding-005 / Gecko — Vertex AI; Gemini-derived.
- Sentence-Transformers / all-MiniLM-L6-v2 — Nils Reimers (now at Cohere); the venerable 384-dim baseline. Cheap, fast, decent quality; still widely used for high-QPS workloads where the latest 1024-dim model is overkill.
Dimension reduction techniques for shrinking storage cost:
- Matryoshka representation learning (MRL) — Kusupati et al. 2022. Train the embedding such that the first 64 / 128 / 256 / 512 dimensions are independently useful. OpenAI v3, Cohere v3+, Voyage, Nomic, Arctic all support MRL.
- Binary quantization — convert float32 to 1 bit per dimension. 32× compression; Hamming distance; surprisingly good top-N recall when followed by a reranker. Cohere, Voyage, OpenAI v3 all explicitly support.
- int8 quantization — Cohere’s preferred path; 4× compression; near-lossless.
Hybrid retrieval — dense + sparse + rerank
Production RAG almost never uses dense alone. The standard pipeline:
- Dense retrieval — embed the query, find top-k (typically k=20-50) via HNSW/DiskANN/IVF.
- Sparse retrieval — BM25 (Lucene/Elasticsearch/Tantivy) or learned sparse (SPLADE, Elastic ELSER, BGE-M3 sparse). Top-k.
- Fusion — combine the two ranked lists, usually via Reciprocal Rank Fusion (RRF) (Cormack et al. 2009;
score = sum(1 / (k + rank))). - Rerank — pass the fused top-N through a heavier cross-encoder model that scores each query-document pair jointly. Cuts the candidate set to top-5 or top-10 for the LLM.
Rerankers available as APIs:
- Cohere Rerank 3 / Rerank 3.5 (2024) — multilingual; the most-used managed reranker.
- Jina Rerank v2 — Jina AI; multilingual.
- Voyage AI rerank-2 — Voyage’s reranker; English-strong.
- BAAI BGE-reranker-v2-m3 — open-weight; runs locally.
- MS MARCO cross-encoder — sentence-transformers; the open-weight baseline.
Sparse retrieval models:
- SPLADE / SPLADE++ (Naver Labs Europe; 2021-2022) — learned sparse embeddings (sparse term-expansion via masked-language-model probabilities).
- Elastic ELSER — Elastic’s proprietary learned sparse; trained on Elastic’s data.
- BGE-M3 sparse output — bge-m3 produces dense + sparse + multi-vector simultaneously.
- BM25 — the venerable 1994 baseline (Robertson + Sparck Jones); still extremely competitive after dense retrieval became common.
2024-2026 trends
- Serverless pricing dominates new launches — Pinecone Serverless v3 (Jan 2024), Turbopuffer (object-storage-priced), MongoDB Atlas Search (pay-as-you-go), Convex, Cosmos DB pay-per-request. Per-pod / per-CPU dedicated is now the option, not the default.
- Object-storage-primary architectures — Turbopuffer, Lance, Milvus 3.0’s disaggregated stage-storage, Vespa Cloud’s tiered architecture; following the same WarpStream-for-Kafka playbook of “S3 as the durability layer, RAM as cache.”
- Multi-vector and named vectors become standard — Weaviate (named vectors v1.24), Qdrant (multi-vector v1.10), Vespa (tensor-native from day one), pgvector (multiple columns); store one vector per chunk + one per title + one per image and query each independently.
- GPU index build — Milvus, Qdrant (v1.13+), Pinecone Serverless; cuts a 100M-vector HNSW build from hours to minutes.
- Sparse + dense unified in one index — Qdrant sparse vectors, BGE-M3 unified dense+sparse output, Vespa tensor-sparse, pgvector sparsevec. The dense/sparse split is converging at the storage layer.
- Vector capabilities in every existing DB — Postgres, MySQL/MariaDB, MongoDB, Redis, Cassandra, SQLite, DuckDB, Elasticsearch, OpenSearch, Solr, Cosmos DB, Couchbase, Aerospike, SingleStore all now have HNSW. The standalone-vector-DB category compresses to differentiated cases (huge scale, multimodal, agentic).
- Embedding model bundling — Pinecone Inference, Atlas + Voyage (post-acquisition), Cohere Embed in Azure AI Search, Cohere + Bedrock, Vertex AI text-embedding. The DB and the embedding model are increasingly sold together.
- Rerankers cheap and ubiquitous — Cohere Rerank 3, Voyage rerank-2, Jina, AWS Bedrock Rerank, BGE-reranker — rerank as a managed API call costs <$0.50 per million tokens and is now expected in production RAG.
- Long-context models reduce the need for fine chunking — Gemini 2.0 Pro (2M tokens), Claude Opus 4.x (200K-1M), GPT-4.1 (1M). For some use cases, ingest the whole corpus into context and skip retrieval entirely; for others, retrieve large coarse chunks (10K-20K tokens) rather than 500-token paragraphs.
- Binary embeddings + Matryoshka mainstream — 32× storage reduction is real; binary + reranker is now a standard cost-optimization pattern.
Filter strategies — pre-filter vs post-filter vs filtered-graph
A subtle correctness + performance issue that defines a lot of vector-store engineering. The typical query is “find similar vectors where tenant_id = 42 AND date > 2025-01-01.”
- Post-filter — retrieve top-K candidates by vector similarity, then drop the ones that don’t pass the filter. Fast to implement but can return fewer than K results (or zero) when the filter is selective. Acceptable when filters are low-cardinality and selective.
- Pre-filter — apply the filter first to identify candidate IDs, then run vector search over the filtered subset. Correct but slow on large corpora because the graph index assumes you can traverse arbitrary neighbors. Used as a fallback in pgvector with
WHEREclauses. - Filtered-graph search (graph-with-filter) — traverse the HNSW graph with the filter applied at each step. This is the state-of-the-art approach in Qdrant (payload index + filtered HNSW), Weaviate (multi-tenant collections), Milvus (partition + scalar index), Vespa (filter pushdown into HNSW), and Pinecone (post-2023 architecture). The cost is correct top-K with bounded latency overhead.
- Single-tenant indexes — for large multi-tenant deployments, give each tenant its own HNSW index rather than filtering a shared one. Pinecone namespaces, Weaviate multi-tenancy mode, Qdrant collections, Milvus partitions all support this. Trade-off: fewer queries to amortize the index build cost over per tenant.
Cardinality awareness matters: a tenant with 100K vectors is fine on a shared index with filtering; a tenant with 100M is fine with a dedicated index; the messy middle (1M-10M) benefits from explicit tenant partitioning where supported.
Vector-aware data formats and lakehouse integration
A 2024-2026 pattern: vectors as a column type in lakehouse + open table formats, queried alongside structured columns by Spark / Trino / DuckDB / Polars.
- Apache Iceberg — vector columns are stored as binary or as fixed-length float arrays; query engines provide vector-search UDFs. No native HNSW index in Iceberg 1.5/1.6 (2024) — the index lives separately. Pinecone, Databricks Vector Search, Snowflake Cortex sync from Iceberg.
- Apache Hudi, Delta Lake — same story; vector columns supported but indexes are external.
- Lance — LanceDB’s columnar file format; first-class vector indexing built in (IVF-PQ + HNSW). The standout open format for vector-native data lakes.
- Parquet vector arrays — Parquet
FIXED_LEN_BYTE_ARRAYorLIST<FLOAT>columns; queryable in DuckDB vss, Polars, Spark MLlib. No native index — relies on the query engine. - Arrow Flight + Arrow IPC — increasingly the wire format for ferrying embeddings between embedding service → vector store → reranker. Lance, Polars, DataFusion all Arrow-native.
Benchmarks and evaluation
Comparing vector stores is harder than comparing OLTP databases because recall is a tunable parameter, not a property. The standard methodology fixes recall@10 ≥ 0.95 and compares latency + cost.
- ANN-Benchmarks — Erik Bernhardsson + community 2017; the academic standard. SIFT1M, GIST1M, GloVe, Deep1B, NYT datasets across HNSW / IVF / Annoy / FAISS / NMSLIB / ScaNN. Recall vs queries-per-second curves.
- VectorDBBench — Zilliz 2023; Milvus + Pinecone + Weaviate + Qdrant + Elasticsearch + Redis + PostgreSQL pgvector head-to-head on standardized datasets (Cohere-1M, Cohere-10M, OpenAI-500K, OpenAI-5M, SIFT-128-Euclidean). Reports recall, QPS, p99 latency, build time, memory.
- MTEB (Massive Text Embedding Benchmark) — Hugging Face + Niklas Muennighoff 2022; the benchmark for embedding models, not for vector stores. 56+ tasks across 8 task types (retrieval, reranking, clustering, classification, STS, etc.) and now multilingual extensions (MMTEB 2024). Models rank here.
- BEIR — heterogeneous retrieval benchmark; classic baseline for retrieval quality.
- MS MARCO — Microsoft 2016; the foundational web-search retrieval dataset.
Caching, batching, and serving patterns
Production vector search rarely runs one query at a time:
- Query embedding cache — many user queries repeat (“how do I cancel?”); cache the query embedding by hash. Redis or in-memory LRU.
- Batched embedding — when ingesting documents, batch chunks into 32-128-vector embedding model calls rather than one-at-a-time; 10-50× throughput improvement.
- Async rerank pipeline — return the top-10 dense-retrieval result immediately to the LLM, run a reranker in parallel that may swap in better results for the next conversation turn.
- HyDE (Hypothetical Document Embeddings) — Gao et al. 2022; embed a hypothetical answer (LLM-generated from the query) rather than the query itself; often improves recall on short-query / long-document corpora.
- Multi-query retrieval — generate N paraphrases of the query (via LLM), retrieve top-K for each, fuse with RRF.
- GraphRAG — Microsoft Research 2024; build an entity-relation knowledge graph from the corpus, embed both nodes and edges, query both. Better at multi-hop reasoning than naive chunking.
Adjacent
- database-engine-taxonomy — pgvector, sqlite-vec, DuckDB vss, Cassandra SAI, MongoDB Atlas Search live in the broader database catalog.
- llm-landscape — vector stores are foundational to RAG, the dominant LLM deployment pattern; embedding models cross-reference here.
- rag-embeddings-vector-search — Tier 2 reference for the retrieval-augmented-generation system architecture.
- ml-framework-comparison — the frameworks that produce these embedding models train and serve.
- cloud-provider-service-mapping — Bedrock KB, Vertex AI Vector Search, Azure AI Search mapped across hyperscalers.
- distributed-systems-fundamentals — sharding, replication, consistency background for distributed vector stores.