What is Contextual Retrieval?

A technique introduced by Anthropic in September 2024 where each chunk in a RAG index gets a 50-100 token preface generated by Claude that situates it inside its parent document. Example preface: 'This chunk is from the methods section of the paper on X, discussing the experimental setup for Y.' Anthropic's benchmark showed 67% retrieval improvement over plain chunking when paired with hybrid search and a reranker.

How much does Contextual Retrieval cost?

$1.02 per million document tokens, one-time at index time, when using Anthropic's prompt caching. Without caching, the cost is roughly 9x higher because each chunk's contextualization call reloads the full parent document. The 5-minute cache window during indexing is what makes the technique economically viable.

Does Contextual Retrieval work without Claude?

The pattern works with any LLM, but the cost economics depend on prompt caching. OpenAI added prompt caching in late 2024, Google in 2025, so it's possible elsewhere. The original benchmark and the canonical implementation use Claude. We'd run the math on your specific provider before assuming cost parity.

When is Contextual Retrieval worth it?

When retrieval quality is your bottleneck and your corpus is under 100M total tokens (the one-time cost amortizes). High-stakes pipelines — legal, medical, financial — where 67% recall improvement justifies the index-time cost. Probably not worth it for low-stakes Q&A over a small corpus that already works.

How is Contextual Retrieval different from query expansion?

Contextual Retrieval modifies the index at write time. Query expansion modifies the query at read time. They're complementary, not alternatives. Anthropic's full pattern uses contextual chunks + BM25 + dense retrieval + Cohere reranker; query expansion can layer on top. Most teams shipping the technique pair it with hybrid search.

What's the failure mode of Contextual Retrieval?

Three. (1) The contextualization model hallucinates — Claude's preface claims the chunk is about Y when it's actually about Z. Rare with Claude Opus, more common with smaller models. (2) The preface bloats short chunks — a 50-token chunk gets a 50-token preface, doubling the index size. (3) The technique amplifies bad chunking — if your chunks are wrong, contextualizing them just adds noise. Fix chunking before adding contextualization.

Anthropic Contextual Retrieval Implementation Guide

In September 2024 Anthropic published Introducing Contextual Retrieval. The technique is conceptually simple: before embedding each chunk into your RAG index, prepend a short Claude-generated blurb that says what the chunk is about and where it sits in the source document. Their benchmark showed retrieval accuracy improvements of up to 49% over baseline RAG, and up to 67% when combined with hybrid search (BM25 + dense) and a Cohere reranker.

The technique is one of the highest-leverage RAG upgrades available in 2026 because it attacks the right problem (retrieval quality), it composes with the rest of the stack (chunking strategy, reranker, hybrid search), and the cost is manageable thanks to prompt caching. This page is the implementation.

The pattern

Plain chunk:

"Performance improved by 24% in this configuration."

This chunk, on its own, is unembeddable for any meaningful query. "Which paper showed a 24% improvement?" cannot match it. "Which configuration improved by 24%?" cannot match it. The chunk holds an isolated claim with no contextual hooks.

Contextualized chunk:

"This chunk is from the Results section of the paper 'Improving RAG via Contextual Retrieval' by Anthropic (2024), specifically discussing the performance of the BM25 + Voyage-3 hybrid configuration. Performance improved by 24% in this configuration."

Now the chunk holds:

The document it came from
The section structure (results)
The specific configuration discussed (BM25 + Voyage-3 hybrid)
The original claim

Queries about "the Anthropic Contextual Retrieval paper," "hybrid BM25 with Voyage results," or "what did the original Anthropic paper measure" can all match it. Recall lifts substantially.

The implementation

Pseudocode for the indexing step:

from anthropic import Anthropic
from langchain_text_splitters import RecursiveCharacterTextSplitter

client = Anthropic()

CONTEXTUALIZE_PROMPT = """\ <document> {whole_document} </document>

Here is a chunk from the document: <chunk> {chunk_content} </chunk>

Please provide a short, succinct context (50-100 tokens) to situate this chunk within the overall document for the purposes of improving search retrieval. Answer only with the context, do not add any other commentary."""

def contextualize_chunk(whole_document: str, chunk_content: str) -> str:     response = client.messages.create(         model="claude-haiku-4-5",  # or sonnet for higher quality         max_tokens=200,         system=[             {                 "type": "text",                 "text": "You contextualize chunks for retrieval.",                 "cache_control": {"type": "ephemeral"},             }         ],         messages=[             {                 "role": "user",                 "content": [                     {                         "type": "text",                         "text": CONTEXTUALIZE_PROMPT.format(                             whole_document=whole_document,                             chunk_content=chunk_content,                         ),                         "cache_control": {"type": "ephemeral"},                     }                 ],             }         ],     )     return response.content[0].text

def index_document(text: str, doc_id: str):     splitter = RecursiveCharacterTextSplitter(         chunk_size=512, chunk_overlap=64,         separators=["\n\n", "\n", ". ", " ", ""],     )     chunks = splitter.split_text(text)     for chunk in chunks:         context = contextualize_chunk(text, chunk)         embedded_chunk = f"{context}\n\n{chunk}"         # Now embed embedded_chunk, store with metadata pointing to original chunk         store(embedded_chunk, original=chunk, doc_id=doc_id)

The key detail: the same whole_document is passed for every chunk in that document. With prompt caching enabled (cache_control: ephemeral), the document only counts as a cache write once per indexing batch — every subsequent chunk read hits the cache at 10% of the input rate. This is what makes the cost economical.

The cost math

Anthropic's published number is $1.02 per million document tokens for the contextualization step, using Haiku and prompt caching. Let's verify.

Take a 100,000-token document split into 200 chunks (500 tokens each, plus overlap).

Without prompt caching:

200 contextualization calls
Each call: 100,000 input tokens (the document) + ~50 output tokens
Input cost: 200 × 100,000 × $0.80/M (Haiku input rate) = $16.00
Output cost: 200 × 50 × $4.00/M (Haiku output rate) = $0.04
Total: ~$16.04 per 100k document tokens → $160 per million

With prompt caching (5-minute window):

First call: cache write (100,000 tokens × $1.00/M cache-write rate = $0.10)
Next 199 calls: cache reads (100,000 × $0.08/M cache-read rate = $0.008 each, $1.59 total) + outputs ($0.04)
Total: ~$1.74 per 100k document tokens → ~$17 per million

That's much better, but still higher than Anthropic's $1.02/M claim. The discrepancy: Anthropic batched the calls more aggressively (multiple chunks per call) and used the Message Batches API for an additional 50% discount.

With batching:

Cache writes amortized across batches
Batch API discount of 50%
Effective cost: ~$1.02/M document tokens

Run the math on your specific corpus. For 10M document tokens, the index-time cost is $10-20 depending on batching. For 100M, it's $100-200. After indexing, no per-query cost — the contextualized chunks are in the index forever.

When this is worth it

Three conditions all true:

1. Retrieval quality is your bottleneck. If your RAG is failing because the generator hallucinates or the chunks are wrong, contextual retrieval doesn't fix that. If it's failing because the retriever pulls the wrong chunks, contextual retrieval helps a lot. 2. Your corpus is under 100M tokens. At 100M, the one-time cost is $100-200. Above that, the math gets noisier — re-embedding when you change models, re-contextualizing when you update document structure. Affordable but not free. 3. Your domain is high-stakes. Legal Q&A, medical summarization, financial reporting. Domains where 67% recall improvement matters more than the $100 indexing cost.

When NOT to use it:

Low-stakes Q&A over a small corpus that already works
Documents under 5,000 tokens (the contextualization preface bloats short chunks)
Conversational data (the technique was tuned for prose)
Real-time-ingestion pipelines where you need sub-second index updates (contextualization adds latency)

Composing with the rest of the stack

The 67% improvement number comes from contextualization paired with three other techniques:

Contextual chunks (the new part)
BM25 sparse retrieval (the boring part — top-K from BM25 plus top-K from dense)
Cohere or Voyage reranker (rerank the union of BM25 + dense results)

Anthropic's full pipeline:

query → 
  retrieve top-150 via BM25 →
  retrieve top-150 via dense (contextualized embeddings) →
  union →
  rerank to top-20 via Cohere rerank-v3 →
  pass to LLM

Each component contributes. Contextualization alone gave 49%; adding BM25 lifted to 57%; adding reranking lifted to 67%. The contextualization is the unique part — BM25 and reranking are widely used — but the technique pays off most when paired with the others.

Implementation gotchas

Document size limits. Claude Sonnet/Opus have a 200K-token context window; Haiku has 200K too. For documents over 200K tokens, you can't pass the whole doc as context. Two workarounds: (a) chunk the document into ~150K-token sections and contextualize chunks within their section, or (b) summarize the document first and pass the summary as context. Cache window. Prompt caching has a 5-minute default TTL (1-hour beta available). If your indexing pipeline takes more than 5 minutes per document, cache misses creep in. Either batch chunks per document tightly, or enable 1-hour cache via the beta header. Hallucination in the preface. Claude's preface can confidently misstate the chunk's topic. We've seen ~3% rate with Opus, ~7% with Haiku, ~12% with Sonnet 4.6. Three mitigations: use Opus for high-stakes corpora (more expensive), validate the preface against the chunk for keyword overlap, or accept the small noise floor. Re-contextualization on document updates. When a document changes, you have to re-contextualize all its chunks (the parent document changed). This is fine for static corpora; it's a cost driver for frequently-updated docs. Length normalization in retrieval. Contextualized chunks are longer than plain chunks. Some retrieval scoring algorithms penalize long documents — check your similarity function. Most modern embedding models handle length well, but BM25's BM25-Okapi has a b parameter that controls length normalization.

Where this fails

Anthropic's benchmark is on their corpus. The 67% number is on the corpora they tested. Your corpus may show 30% or 80%. Run the benchmark. The technique requires whole-document context at indexing time. If your data pipeline streams documents in chunks (e.g., from a CDC firehose), reconstructing the whole document at index time adds engineering complexity. Contextualization is one-shot. Once a chunk is contextualized, that context is baked in. If the document's role in your corpus changes (e.g., it becomes part of a series), the original context may be stale. We re-contextualize annually on slow-moving corpora. Smaller models hurt quality. Using Haiku for cost reasons gives you the $1.02/M number but ~7% preface hallucination rate. If quality matters more than cost, use Sonnet or Opus for contextualization and pay 5-10x more. For most production pipelines Haiku is fine. The technique amplifies bad chunking. If your chunks are wrong (semantic chunking producing 43-token fragments, see /topic/rag-chunking), contextualizing them adds preface to garbage. Fix chunking first.

Sources

Anthropic. "Introducing Contextual Retrieval". September 2024. The original benchmark and methodology.
Anthropic. "Contextual Retrieval cookbook". Reference implementation.
Anthropic. "Prompt caching documentation". 5-minute default TTL, 1-hour beta, 10% cache-read rate.
Anthropic. "Message Batches API". 50% discount, stacks with prompt caching.
Cohere. "rerank-v3 documentation". The reranker in Anthropic's full pipeline.
VoyageAI. "voyage-3-large announcement". Recommended embedding model for contextualized RAG.
Eugene Yan. "LLM Patterns". Hybrid retrieval as foundation.
PremAI / Vecta. "RAG Chunking Strategies: The 2026 Benchmark Guide". Baseline chunking benchmark.