RAG, Embeddings, Vector Search — Compute Reference
1. At a glance
Retrieval-Augmented Generation (RAG) is the dominant production pattern for grounding large language models in external knowledge. The shape is simple: retrieve relevant context from a corpus, augment the LLM prompt with that context, generate the answer. The corpus is anything the model didn’t see during pretraining — your wiki, your codebase, your customer tickets, last week’s news, a regulatory PDF — and retrieval brings in only what matters for the current query.
RAG solves four practical problems at once:
- Hallucination — when the model is given grounded text, it cites and quotes instead of confabulating. Reduction is large but not absolute; you still need evaluation.
- Knowledge cutoff — pretraining ends on a date; RAG injects fresh information without retraining.
- Citation / auditability — retrieved chunks carry source URLs and IDs that the model can quote, satisfying compliance and trust requirements.
- Cost — a small model with good retrieval often beats a frontier model with no retrieval, at a fraction of the inference cost.
The seminal paper is Lewis et al. 2020, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (NeurIPS), which proposed a dense-retriever + seq2seq generator trained end-to-end on natural-questions. The technique stayed niche until ChatGPT (November 2022) created consumer-scale demand for a way to plug company-specific data into a chat model without fine-tuning. By 2023 every major vector database had launched; by 2024 RAG was the default architecture taught in every “build with LLMs” tutorial; by 2026 it is a commodity stack with established best practices, well-understood failure modes, and a mature evaluation discipline.
The naive version — chunk, embed, top-k cosine, stuff into prompt — gets you 60% of the way. The production version layers in hybrid retrieval, re-ranking, query rewriting, metadata filters, eval harnesses, and frequently a graph or summary index. This reference covers the full stack at the level a senior engineer needs to design, debug, and budget for.
2. Architecture
A production RAG system has three operational stages, each with its own latency and cost profile:
Stage 1 — Ingest (offline, periodic)
- Source loaders pull from S3, Confluence, Notion, GitHub, RDBMS, web crawl, etc.
- Parsers / extractors turn PDF / DOCX / HTML / PPTX into clean text. The hard cases are tables, scanned PDFs (OCR), code blocks, math. Tools: Unstructured.io, LlamaParse, Marker, PyMuPDF, Apache Tika.
- Chunkers split documents into retrievable units (see §3).
- Embedders call an embedding model to produce a dense vector per chunk.
- Metadata extraction attaches structured fields: source URL, author, date, ACL, language, doc type. This is where filters live.
- Index writes push vectors + metadata into the vector DB; optionally also a BM25 / inverted-text index for hybrid.
Ingest is throughput-bound and embarrassingly parallel. Budget: embedding cost is the dominant line item for large corpora at first, then storage.
Stage 2 — Retrieve (online, per query)
- Query understanding — optionally rewrite, expand, decompose, or classify the query.
- Embed query with the same model used at ingest (a mismatched embedder is a common silent failure).
- ANN search — top-k (typically k=20-100) by cosine / dot product.
- Hybrid merge — optionally fuse with BM25 results via RRF.
- Metadata filter — restrict by ACL, recency, language, etc.
- Re-rank — pass top-N through a cross-encoder or reranker model to keep top-k′ (typically k′=3-10).
- Context assembly — concatenate, dedupe, summarize if over budget.
Retrieve is latency-bound. Budget: target <300 ms end-to-end before the LLM call begins.
Stage 3 — Generate (online, per query)
- Prompt construction — system prompt + retrieved chunks + user query + instructions for citation format.
- LLM call — frontier or fine-tuned model, often with structured output (JSON, citations).
- Postprocess — verify citations resolve, optionally fact-check, return.
Generate is cost-bound. LLM tokens dominate unit economics at scale.
3. Chunking
Chunking is the single highest-leverage knob in the system and the most under-thought. Bad chunks break everything downstream. Strategies, in order of sophistication:
- Fixed-size character / token — split every N characters or tokens. Trivial, works surprisingly well for prose. Pair with overlap (50-100 tokens) so a thought isn’t sliced.
- Sentence boundary — split on NLTK / spaCy sentence detection, then pack into target size. Better for narrative text.
- Paragraph — split on
\n\n. Good baseline for blog posts, articles. - Recursive — try paragraph splits first; if a chunk is still too big, fall back to sentence; then word; then character. LangChain
RecursiveCharacterTextSplitteris the reference implementation and a sensible default. - Semantic — embed candidate breakpoints, cut where adjacent sentences have low cosine similarity (Greg Kamradt’s “semantic chunking”). Tracks topic boundaries instead of syntactic ones.
- Structure-aware — Markdown headers as anchors (
MarkdownHeaderTextSplitter), HTML tag hierarchy, code AST (tree-sitter), PDF section headings. Always prefer this when structure exists. - Late chunking (Jina 2024) — embed the whole document with a long-context embedder, then pool sub-ranges. Preserves long-range context.
Typical target: 200-1000 tokens per chunk with 10-15% overlap. Smaller chunks are more precise (the retrieved unit matches the query closely) but produce more noise and lose surrounding context. Larger chunks dilute the embedding (one chunk talks about three things, none of which dominate) and cap how much you can fit in the prompt.
Two rules that hold up: (a) never split a code block, a table, or a list mid-item; (b) prepend the section header / breadcrumb / document title to every chunk so the embedder sees its disambiguating context (this is half of Anthropic’s “contextual retrieval” trick).
4. Embedding models
An embedding model maps text to a fixed-length dense vector such that semantically similar inputs land near each other. Quality varies wildly by domain (general web vs. legal vs. code vs. medical) and by language.
Closed / API-only
- OpenAI —
text-embedding-3-small(1536 dim, $0.02/1M tokens) is the cost-effective default;text-embedding-3-large(3072 dim) tops their lineup. Both support Matryoshka representation learning so you can truncate to 256 / 512 / 1024 dim for storage savings with mild quality loss. - Cohere —
embed-v3-english,embed-v3-multilingual(1024 dim); theinput_typeparameter (search_queryvssearch_document) is mandatory and a frequent source of bugs when omitted. - Voyage AI —
voyage-3,voyage-large-3-instruct; consistently at or near the top of MTEB leaderboards in 2025-26. Voyage was acquired by MongoDB in 2024. - Google Vertex —
text-embedding-gecko,text-multilingual-embedding.
Open-weight
- BGE family (BAAI, Beijing Academy of Artificial Intelligence) —
bge-large-en-v1.5is the workhorse open model;bge-m3(Chen et al. 2024) is multi-vector + multi-lingual + multi-granularity (dense + sparse + ColBERT-style in one model). - E5 (Wang et al. 2022, Microsoft) —
e5-large-v2,multilingual-e5-large,e5-mistral-7b-instruct; instruction-tuned variants take a task description string. - Stella (Dun Zhang) — punches above its weight on MTEB at small parameter counts.
- GTE (Alibaba Damo) —
gte-large,gte-Qwen2-7B-instruct. - Nomic Embed (Nomic AI 2024) —
nomic-embed-text-v1.5, fully open including training data; Matryoshka-trained. - mxbai-embed-large-v1 (Mixedbread AI) — competitive small model.
- jina-embeddings-v3 (Jina AI 2024) — long context (8192), task-LoRA architecture.
Frameworks and tuning
Sentence-Transformers (Reimers + Gurevych 2019, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”) is the canonical training and inference library. It exposes a clean API (model.encode(texts)) and is the de facto path for fine-tuning your own embedder on domain data via contrastive loss (MultipleNegativesRankingLoss is the workhorse).
Instruction-tuned embeddings take the task as a prefix:
"Represent this sentence for retrieving relevant documents: <query>"
E5-instruct, BGE-instruct, Voyage-instruct, and the Qwen embedders all use this pattern. The query and document use different instruction strings.
Late-interaction (multi-vector)
Single-vector models compress a passage into one dense point — lossy by construction. ColBERT (Khattab + Zaharia 2020 SIGIR, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT”) emits one vector per token and scores query-document pairs by MaxSim (each query token’s best match). ColBERTv2 (Santhanam et al. 2022 NAACL) added residual compression so the index size is competitive with single-vector indexes. ColBERT consistently wins on recall and is increasingly viable for production via PLAID (Santhanam 2022) and ColPali for vision-language. The 2024-26 trend is hybrid systems that use single-vector for first-stage, ColBERT for re-rank.
MTEB
The Massive Text Embedding Benchmark (Muennighoff et al. 2022) is the standard evaluation suite — 56 datasets across classification, clustering, retrieval, reranking, STS, summarization. Treat the leaderboard as directional, not gospel: gains on MTEB sometimes don’t transfer to your specific domain. Always run an in-domain eval before committing to an embedder.
5. Vector indexes and ANN algorithms
For corpora above ~100k vectors, exact nearest-neighbor search becomes too slow. Approximate Nearest Neighbor (ANN) algorithms trade a small recall hit for orders of magnitude speedup.
- Flat / exact — brute-force cosine over every vector. Fine for <100k vectors and gives ground-truth recall; useful as a recall baseline for evaluating ANN tuning.
- HNSW (Hierarchical Navigable Small Worlds) (Malkov + Yashunin 2018, IEEE TPAMI “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs”) — the mainstream default in 2026. Builds a multi-layer graph where higher layers are sparse “highways” and lower layers are dense local neighborhoods; queries greedy-descend from top to bottom. Tunable via
M(graph degree),efConstruction(build quality),efSearch(query effort). Excellent recall/latency, high RAM cost (vectors live in memory + graph overhead). - IVF (Inverted File) — partition vector space into Voronoi cells via k-means; at query time, search only the nearest few cells (
nprobecells). Cheaper to build than HNSW but lower recall at equal latency. - IVF-PQ (Jégou, Douze, Schmid 2011 IEEE TPAMI, “Product quantization for nearest neighbor search”) — IVF combined with product quantization, which splits each vector into sub-vectors and clusters each sub-vector space independently. A 1024-dim float32 vector (4 KB) compresses to ~64 bytes. Memory-efficient at the cost of recall; the canonical billion-scale baseline.
- OPQ (Optimized Product Quantization) — rotates the vector space before PQ so sub-vectors are more independent; recovers some of the recall PQ loses.
- ScaNN (Guo et al. 2020, Google, “Accelerating Large-Scale Inference with Anisotropic Vector Quantization”) — anisotropic quantization tuned for inner product; Google’s internal default; benchmarks well at high recall targets.
- DiskANN / Vamana (Subramanya et al. 2019, Microsoft NeurIPS, “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node”) — graph index designed so most of it sits on SSD, not RAM; a single commodity node can serve billions of vectors. Used in Bing and Microsoft 365 search backends.
- FreshDiskANN (Singh et al. 2021) — extends DiskANN with incremental insert / delete so the index doesn’t need full rebuilds on each update.
- SPANN (Chen et al. 2021, Microsoft) — alternative billion-scale design combining IVF with a small in-memory centroids index and disk-resident posting lists.
- GPU —
cuVS(NVIDIA RAPIDS, 2024) and FAISS-GPU push HNSW / IVF / brute-force onto GPUs; useful for very large indexes or very high QPS.
Tuning rules of thumb
- HNSW
M=16, efConstruction=200, efSearch=100is a reasonable starting point for production. - Always measure recall@k against an exact baseline on your data before going live.
- ANN recall is sensitive to filter selectivity — if metadata filters reduce the candidate set by 99%, post-filtering top-k may miss everything. Use filter-aware indexes (see §14).
6. Vector databases
The 2022-23 explosion of vector databases left a crowded market that is now consolidating around three patterns: purpose-built managed services, OSS embeddable libraries, and extensions of existing databases.
Managed / purpose-built
- Pinecone — first mover (2019), serverless tier launched 2023, popular for “just works” production deployments.
- Weaviate — OSS-core + cloud; strong hybrid retrieval, modules ecosystem.
- Qdrant (Cloud + OSS) — Rust-based, strong filtering, increasingly common 2024-26.
- Chroma — OSS-first, easy embedded developer experience; also offers cloud.
- Vespa (Yahoo / Vespa.ai) — the most mature hybrid retrieval engine; production-grade ranking pipeline; steep learning curve.
- Vald (Yahoo Japan) — Kubernetes-native ANN platform.
- Azure AI Search — Microsoft managed hybrid (was Azure Cognitive Search); deeply integrated with Azure OpenAI.
- Vertex AI Vector Search (Google) — formerly Matching Engine, built on ScaNN.
- Amazon OpenSearch Serverless (with vector engine) — AWS-native; also Bedrock Knowledge Bases sits on top.
- MongoDB Atlas Vector Search — vector field type in MongoDB; popular for teams already on Atlas.
- Turbopuffer (2023) — object-storage-backed, very cheap at scale.
OSS embeddable libraries
- FAISS (Johnson, Douze, Jégou; Meta 2017) — the OG ANN library; C++ with Python bindings; not a database (no persistence layer, no metadata), but the algorithmic backbone of many products. Implements Flat, IVF, IVF-PQ, HNSW, OPQ, GPU variants.
- hnswlib — header-only C++ HNSW reference implementation by the algorithm’s authors.
- Annoy (Spotify) — tree-based ANN; simpler and slower than HNSW but easy to mmap; still used inside Spotify and many older systems.
- Voyager (Spotify 2023) — Spotify’s HNSW-based successor to Annoy, with better recall.
- USearch (Unum) — extremely fast SIMD HNSW; tiny binary; supports custom metrics.
- ScaNN (Google OSS release) — usable standalone outside Vertex.
- sqlite-vss / sqlite-vec — vector search inside SQLite; the local-first / edge default.
Database extensions (the 2023+ shift)
The biggest change of the last two years is teams skipping a dedicated vector DB and using their existing OLTP database with a vector extension. Single source of truth, no sync, transactional ACID guarantees on vectors and rows together.
- pgvector (Postgres) — by far the most popular; HNSW + IVF-Flat indexes; halfvec / sparsevec / bit types added 2024; mature ecosystem; runs at hundreds of millions of vectors with care.
- pgvecto.rs — Rust-based Postgres extension, generally faster than pgvector at the cost of being newer.
- Redis Vector Similarity Search (VSS) — in-memory, HNSW + Flat; well-suited for low-latency caches.
- Elasticsearch / OpenSearch kNN — bolted onto the dominant BM25 engines; hybrid is native.
- ClickHouse — vector index types (HNSW, USearch) added 2024; useful when your data is already there.
- MariaDB Vector (11.7, 2024) — late entrant but native to a popular SQL stack.
- SingleStore Vector — vector functions in a distributed SQL engine.
- DuckDB-VSS — HNSW for embedded analytics workflows.
Local-first
- LanceDB (Rust, columnar Lance format) — disk-resident, very fast, batteries-included; ideal for laptops, edge, and pipeline-as-code.
- Chroma, Qdrant, Milvus Lite — all run embedded or single-node for development.
Picking one
For a greenfield project in 2026: if you already have Postgres, start with pgvector — you almost certainly won’t outgrow it before you’ve solved the harder problems upstream. If you need hybrid retrieval at scale and want a configurable ranking pipeline, Vespa or Weaviate. If you want fully managed with minimal ops, Pinecone, Qdrant Cloud, or Azure AI Search. If you’re at the billion-vector scale or want SSD-resident indexes, DiskANN-based systems (Milvus, Turbopuffer, custom).
7. Similarity metrics
The choice of distance function is constrained by how the embedder was trained — not by what feels intuitive.
- Cosine similarity —
dot(a,b) / (||a|| · ||b||). The most common metric for text embeddings; invariant to vector magnitude. Range [-1, 1], higher = more similar. - Dot product (inner product) —
sum(a_i · b_i). When vectors are L2-normalized (most modern embedders normalize), dot product equals cosine and is cheaper. ScaNN and DiskANN often assume inner product. - L2 / Euclidean distance —
sqrt(sum((a_i - b_i)^2)). Used by some image embedders and by quantization-heavy pipelines. - Manhattan / L1 —
sum(|a_i - b_i|). Rare for embeddings; more common in nearest-neighbor with categorical data. - Hamming distance — count of differing bits, used for binary embeddings (1-bit quantization of dense vectors; ~32x storage reduction with 90-95% of recall retained).
Always match the model card’s recommended metric. Using cosine where the model was trained for dot product, or vice versa, silently degrades recall by 10-30%.
8. Hybrid retrieval
Pure dense retrieval misses exact-keyword matches: acronyms, identifiers, product codes, rare named entities, code symbols. Pure sparse (BM25) misses paraphrase and semantic similarity. Hybrid retrieval combines both and consistently outperforms either alone in published benchmarks (Bruch et al. 2023 “An Analysis of Fusion Functions for Hybrid Retrieval”).
Sparse retrieval options
- BM25 (Robertson + Walker 1994) — the classical baseline; tokenize, IDF-weight, length-normalize. Lucene, Tantivy, Elasticsearch implementations.
- SPLADE (Formal et al. 2021, “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking”) — learned sparse representations from a BERT MLM; each document maps to a sparse vector in vocab space. Better recall than BM25, fits inverted-index infra. v2 and v3 variants improve quality.
- uniCOIL (Lin + Ma 2021) — learns term weights inside an inverted index.
Fusion
- Reciprocal Rank Fusion (RRF) (Cormack, Clarke, Buettcher 2009 SIGIR, “Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods”) —
score(d) = sum_i 1 / (k + rank_i(d)), typicallyk=60. Simple, robust, works without score calibration. The default choice. - Convex combination —
α · dense_score + (1-α) · sparse_score; requires normalizing scores to comparable ranges. - Learned fusion — train a small model (LambdaMART, neural reranker) to combine.
Native hybrid implementations
Vespa, Weaviate, OpenSearch, Elastic, Qdrant, and Pinecone all expose hybrid search as a single query. Pgvector + Postgres FTS or pg_search (BM25) is a popular DIY hybrid.
9. Re-ranking
First-stage retrieval optimizes for recall (don’t miss anything relevant); re-ranking optimizes for precision (rank the top results correctly). A typical setup retrieves top-100 fast and re-ranks down to top-5 slow.
Cross-encoders
A cross-encoder (Reimers + Gurevych 2019 EMNLP) encodes the (query, document) pair jointly through a transformer and outputs a single relevance score. This is much more accurate than the bi-encoder used for embedding — the model can attend across the query-document boundary — but quadratic in the pair: you must run inference for each candidate. Hence the two-stage pipeline.
Common open models:
cross-encoder/ms-marco-MiniLM-L-6-v2— small, fast, strong baseline.bge-reranker-v2-m3(BAAI) — multilingual, top open reranker as of 2025.bge-reranker-large— the larger sibling.mxbai-rerank-large-v1.jina-reranker-v2.
Managed reranker APIs
- Cohere Rerank (
rerank-v3.5) — popular API, top of leaderboards in many evals. - Voyage
rerank-2. - Mixedbread
mxbai-rerank-large-v1also offered via API.
LLM-as-judge re-ranking
- RankGPT (Sun et al. 2023, “Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents”) — prompt an LLM to rank a list of passages directly.
- RankLLM — open-source framework around the same idea, supports listwise / pairwise / pointwise.
- Strong quality, expensive latency-wise; viable when the candidate set is small and the answer is high-value.
Late-interaction as re-ranker
ColBERT / ColBERTv2 as a second-stage re-ranker is a sweet spot for many systems: dramatically better than cross-encoders on latency, dramatically better than bi-encoders on quality.
10. Multi-modal embeddings
Modern embedders increasingly span modalities — text, image, audio, video — projected into a shared vector space so cross-modal retrieval works (image → captions, text query → matching photos).
- CLIP (Contrastive Language-Image Pretraining) (Radford et al. 2021, OpenAI, “Learning Transferable Visual Models From Natural Language Supervision”) — the watershed paper. Trained on 400M (image, caption) pairs with a contrastive loss; one image encoder + one text encoder share a joint embedding space.
- OpenCLIP (LAION 2022) — open re-implementation trained on LAION-5B; ViT-L, ViT-H, ViT-G variants.
- SigLIP (Zhai et al. 2023 Google, “Sigmoid Loss for Language Image Pre-Training”) — replaces softmax contrastive loss with sigmoid; trains better at smaller batch sizes; SigLIP 2 (2024) and SigLIP 400m are common production picks.
- EVA-CLIP (Sun et al. 2023 BAAI) — masked image modeling pretraining on top of CLIP; high MTEB-vision scores.
- ImageBind (Girdhar et al. 2023 Meta) — binds six modalities (image, text, audio, depth, thermal, IMU) to a shared space via image as anchor.
- Voyage-multimodal-3 — production image+text embedder.
- Cohere Embed v3 Image — same.
- GCP Vertex multimodal embeddings — image+text+video.
- JinaCLIP, Marqo FashionCLIP, PubMedCLIP — domain-tuned variants.
- ColPali (Faysse et al. 2024) — ColBERT-style late-interaction for document images / PDFs; query embedder is text, document embedder is a vision-language model run on rendered pages. Skips OCR entirely and outperforms classical OCR+chunk pipelines.
11. Advanced patterns
The naive pipeline plateaus around 60-70% retrieval quality on hard benchmarks. The 2023-26 research wave produced a set of patterns that close most of the remaining gap.
- HyDE (Hypothetical Document Embeddings) (Gao, Ma, Lin, Callan 2022, “Precise Zero-Shot Dense Retrieval without Relevance Labels”) — ask the LLM to generate a hypothetical answer to the query, embed that, retrieve against the corpus. Works because the hypothetical answer lives in the same “answer space” as real documents while the raw query often doesn’t.
- Query expansion / decomposition — rewrite the query with synonyms, or break a multi-part question into sub-queries retrieved independently and merged.
- Multi-hop reasoning — agentic loop: retrieve, reason, identify what’s still missing, re-retrieve. Tools like LangGraph and DSPy formalize this.
- Self-RAG (Asai et al. 2023, “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”) — train the LLM to emit special tokens that decide when to retrieve and whether the retrieved content was useful.
- Corrective RAG (CRAG) (Yan et al. 2024) — a lightweight retrieval evaluator labels each retrieved chunk as correct / incorrect / ambiguous; the system rewrites queries and falls back to web search when retrieval fails.
- Adaptive-RAG (Jeong et al. 2024) — classify query complexity and route easy queries to direct LLM, medium to single-step RAG, hard to multi-hop.
- GraphRAG (Edge et al. 2024 Microsoft, “From Local to Global: A Graph RAG Approach to Query-Focused Summarization”) — at ingest, extract entities and relations into a knowledge graph; cluster the graph hierarchically; at query time, retrieve the relevant subgraph + community summaries instead of raw chunks. Dramatically improves on whole-corpus “summarize this entire dataset” questions that flat RAG cannot answer.
- Contextual Retrieval (Anthropic 2024 blog post) — before embedding each chunk, prepend a 50-100 token summary describing how the chunk fits in the whole document. Mitigates the context-loss problem of small chunks; in their evaluation it cut retrieval failure by 49% (67% combined with reranking).
- Cache-Augmented Generation (CAG) — for very stable corpora that fit in a long-context window, precompute the model’s KV cache for the corpus and reuse it on every query. Skips retrieval entirely, lowers latency, increases compute per query.
- Recursive / hierarchical retrieval (RAPTOR, Sarthi et al. 2024) — cluster chunks, summarize each cluster, recurse; index summaries alongside chunks so retrieval can land at the right level of abstraction.
- Small-to-Big — embed small sentences but retrieve the surrounding paragraph for context.
12. Frameworks
- LangChain (Harrison Chase, 2022) — the most popular orchestration library; supports every embedder, every vector DB, every LLM. Often criticized for over-abstraction and rapid API churn; still the path of least resistance for greenfield projects, especially with LangGraph (2024) for stateful agentic workflows.
- LlamaIndex (Jerry Liu, 2022) — narrower focus on RAG; excellent for ingest pipelines (LlamaParse, LlamaHub connectors), index abstractions (vector, summary, tree, knowledge graph), and managed eval.
- Haystack (deepset, since 2020) — production-leaning pipelines, good observability, Elasticsearch heritage.
- Semantic Kernel (Microsoft) — C# / Python orchestration for Azure-native stacks.
- DSPy (Khattab et al. 2023, “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”) — declarative programming model where prompts are compiled from few-shot examples and a metric; less about RAG specifically, more about replacing prompt engineering with optimization.
- txtai (NeuML) — single-file all-in-one embeddings + RAG library.
- EmbedChain — light wrapper for simple chatbots over docs.
- R2R (SciPhi) — opinionated production RAG framework with observability built in.
13. Evaluation
A RAG system has at least three failure modes: retrieval misses the right document, retrieval finds it but the LLM ignores it, the LLM uses it but answers wrong anyway. Each needs its own metric.
Retrieval metrics (need labeled query-doc pairs)
- Precision@k — fraction of top-k that are relevant.
- Recall@k — fraction of all relevant docs that appear in top-k.
- MRR (Mean Reciprocal Rank) —
1 / rankof the first relevant doc, averaged. - NDCG (Normalized Discounted Cumulative Gain) — graded relevance, position-discounted.
- Hit@k — did any relevant doc appear in top-k.
End-to-end RAG metrics
- Faithfulness / groundedness — does the answer follow from the retrieved context? (Detects hallucination.)
- Context relevance — were the retrieved chunks actually about the query?
- Answer relevance — does the answer address the question?
- Answer correctness — does the answer match the ground truth (when one exists)?
Frameworks
- Ragas (Es et al. 2023, “RAGAS: Automated Evaluation of Retrieval Augmented Generation”) — the most cited eval framework; LLM-judges faithfulness, context relevance, answer relevance without needing ground truth labels (reference-free) and with them (reference-based).
- TruLens — instrument and evaluate; ships “feedback functions” for groundedness, etc.
- DeepEval — pytest-like; large set of metrics including hallucination, bias, toxicity.
- promptfoo — CLI / YAML eval harness; good for regression testing prompt changes.
- LangSmith (LangChain) — managed tracing + eval.
- Phoenix (Arize) — OSS observability with built-in RAG evals.
- OpenAI Evals — framework + a community of contributed evals.
Synthetic test set generation
- Ragas Synthetic Test Set Generator — uses an LLM to generate (question, ground-truth context, ground-truth answer) triples from your corpus; varies question types (simple, multi-context, reasoning, conditional).
- Giskard — similar capability with additional security/bias probing.
Hallucination detection
- HHEM (Vectara Hughes Hallucination Evaluation Model) — open model that scores whether one text is supported by another.
- Factool (Chern et al. 2023) — extract claims, look each up, score.
- SelfCheckGPT (Manakul 2023) — sample multiple generations, check consistency.
- LLM-as-judge with rubric — most common pragmatic choice; calibrate against humans on a sample.
Reference-based judges (compare to ground truth) are more reliable than reference-free; build a small gold set (50-200 examples) by hand, then let LLM judges score scaled runs against it.
14. Production pitfalls
The list every team rediscovers the hard way:
- Chunk boundaries cutting code, tables, or list items in half. Use structure-aware chunkers; never trust raw character splits on technical documents.
- Embedding model version drift. When you upgrade
text-embedding-3-small→ some-new-model, your existing index is now in the wrong vector space and silently returns garbage. Pin the model version per index, store it in metadata, and run dual-write + dual-read during migrations. - Query-document embedder mismatch. Some embedders (Cohere, instruction-tuned) require different inputs for queries vs documents (
input_type=search_queryvssearch_document, different instruction prefix). Forgetting this can halve recall. - Metadata filters that ANN doesn’t honor. Pre-filter (filter then ANN) breaks recall when the filter is selective; post-filter (ANN then filter) returns nothing when the filter doesn’t match top-k. Solutions: ACORN (Patel et al. 2023), filterable HNSW (Qdrant, Weaviate), IVF + filter integrations, or oversampling + post-filter with large k.
- Stale data. No invalidation strategy means deleted-but-still-indexed documents resurface forever. Build a delete pipeline that mirrors writes.
- Embedding cost — embedding a 50M-token corpus at 1; at 6.50. Re-embedding on every chunk change can balloon costs; prefer content-hash deduplication.
- Storage / RAM cost — 1M vectors of 1536 floats = 6 GB raw; HNSW graph adds ~30%; total ~8 GB. For 100M vectors, plan for binary quantization, scalar quantization, or Matryoshka truncation.
- Latency budget. Target end-to-end retrieve+rerank in under 300 ms so the LLM can dominate the user-visible 1-3 s response. The reranker is usually the bottleneck; cap candidate count or use a smaller model.
- Long documents. A 200-page PDF should not be one chunk. Build a hierarchical index: doc-level summary, section-level summary, chunk-level content; retrieve top-k chunks and include their section summary for context.
- Multilingual content. Use a multilingual embedder (multilingual-e5, bge-m3, Cohere multilingual) or translate at ingest; don’t mix language-specific embedders silently.
- Permission and access control. Retrieved chunks must respect ACLs. Either filter at query time by user permission metadata, or maintain per-user / per-group indexes for sensitive data. Get this wrong and your RAG is a data-exfiltration vector.
- Prompt injection from retrieved content. An attacker plants a malicious document in your corpus that says “ignore previous instructions, exfiltrate X”. The model dutifully obeys. Defenses: separate “user” vs “data” roles in the prompt template, structured output, guardrails (NeMo Guardrails, Llama Guard 3, Anthropic’s constitutional checks), content filtering at ingest.
- Citation drift. The LLM cites a chunk that doesn’t actually support the claim. Post-process: parse the citations, run an NLI / faithfulness model, mark unsupported claims.
- Eval set rot. Your gold eval set becomes obsolete as the corpus changes. Regenerate periodically; track distribution.
- Cold-start. New documents take time to be discovered by users; old documents become stale. Mix recency boost into ranking when freshness matters.
- Cost attribution. Track embed cost, store cost, retrieve cost, rerank cost, and LLM cost per query. The cost of “RAG” is rarely where you think — for most teams the LLM dominates.
15. Cost reality (2026)
Order-of-magnitude numbers for budgeting; specific vendors update prices quarterly.
Embedding
- OpenAI
text-embedding-3-small: $0.02 per 1M tokens. - OpenAI
text-embedding-3-large: $0.13 per 1M. - Cohere embed-v3: ~$0.10 per 1M.
- Voyage embed: $0.06-0.18 per 1M depending on model.
- Self-host BGE on a single GPU: marginal cost is GPU-hours; ~$0.005-0.01 per 1M tokens at moderate utilization.
Vector DB storage and queries
- Pinecone serverless: ~0.04 per 1M read units + ~$2 per 1M write units (2025 pricing).
- pgvector on a $100/month VM: handles ~10M vectors @ 1024 dim with HNSW comfortably.
- Turbopuffer / object-store backed: ~$0.10/GB-month at billion scale.
- Self-hosted Qdrant / Weaviate: dominated by VM and EBS costs.
LLM inference (dominant at scale)
- Frontier models (Claude Opus, GPT-class, Gemini Ultra equivalents): 15-75 per 1M output.
- Mid-tier (Sonnet, GPT-4o-mini-class): 2-15 per 1M output.
- Open-weight self-hosted (Llama 3.x 70B, Qwen 2.5 72B on H100/H200): ~$0.20-1 per 1M depending on utilization; throughput-bound.
A typical RAG query in 2026 has: ~200 input tokens (query) + 2000 input tokens (retrieved context) + 400 output tokens. At Sonnet-class pricing that’s ~$0.008 per query in LLM cost; the embedding + retrieval cost is rounding error. Optimization priority: get the LLM bill down before the vector DB bill.
16. Cross-references
- transformer-architecture — the model family that consumes retrieved context; attention is what makes retrieved chunks usable.
- _index — Compute domain index.
- linear-algebra-essentials — cosine, dot product, vector norms, Euclidean and Manhattan distances; the mathematical primitives behind every similarity score.
- probability-fundamentals — softmax, KL divergence, contrastive loss; the training-time math for embedders.
- genai-llm-runtime — vLLM, TGI, llama.cpp; serving the LLM half of the pipeline.
- semiconductor-materials — GPU and HBM hardware that determines embedder throughput and ANN-on-GPU economics.
17. Citations
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
- Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
- Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
- Malkov, Y. A., Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI.
- Jégou, H., Douze, M., Schmid, C. (2011). Product Quantization for Nearest Neighbor Search. IEEE TPAMI.
- Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML (CLIP).
- Khattab, O., Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR.
- Santhanam, K. et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL.
- Guo, R. et al. (2020). Accelerating Large-Scale Inference with Anisotropic Vector Quantization. ICML (ScaNN).
- Subramanya, S. J. et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS.
- Johnson, J., Douze, M., Jégou, H. (2017). Billion-scale similarity search with GPUs. (FAISS).
- Wang, L. et al. (2022). Text Embeddings by Weakly-Supervised Contrastive Pre-training. (E5).
- Chen, J. et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings.
- Muennighoff, N. et al. (2022). MTEB: Massive Text Embedding Benchmark. EACL.
- Zhai, X. et al. (2023). Sigmoid Loss for Language Image Pre-Training. (SigLIP).
- Girdhar, R. et al. (2023). ImageBind: One Embedding Space To Bind Them All. CVPR.
- Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models.
- Gao, L., Ma, X., Lin, J., Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels. (HyDE).
- Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.
- Yan, S.-Q. et al. (2024). Corrective Retrieval Augmented Generation. (CRAG).
- Jeong, S. et al. (2024). Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.
- Edge, D. et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft.
- Sarthi, P. et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. ICLR.
- Anthropic (2024). Introducing Contextual Retrieval. (blog).
- Formal, T. et al. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.
- Cormack, G. V., Clarke, C. L. A., Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods. SIGIR.
- Bruch, S. et al. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM TOIS.
- Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.
- Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation.
- Sun, W. et al. (2023). Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. (RankGPT).
- Manakul, P. et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.
- Chern, I. et al. (2023). FacTool: Factuality Detection in Generative AI.
- Patel, L. et al. (2023). ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data.
- pgvector documentation — https://github.com/pgvector/pgvector.
- FAISS documentation and wiki — https://github.com/facebookresearch/faiss/wiki.
- MTEB leaderboard — https://huggingface.co/spaces/mteb/leaderboard.
- Robertson, S., Walker, S. (1994). Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. SIGIR (BM25).