Topic · A4+A1
Anthropic Contextual Retrieval Implementation Guide
Anthropic's Contextual Retrieval pattern prefaces each chunk with a Claude-generated contextual blurb. Their benchmark showed 67% retrieval improvement at $1.02 per million document tokens. This is the implementation, the cost math, and where the technique fails.
In September 2024 Anthropic published Introducing Contextual Retrieval. The technique is conceptually simple: before embedding each chunk into your RAG index, prepend a short Claude-generated blurb that says what the chunk is about and where it sits in the source document. Their benchmark showed retrieval accuracy improvements of up to 49% over baseline RAG, and up to 67% when combined with hybrid search (BM25 + dense) and a Cohere reranker.
The technique is one of the highest-leverage RAG upgrades available in 2026 because it attacks the right problem (retrieval quality), it composes with the rest of the stack (chunking strategy, reranker, hybrid search), and the cost is manageable thanks to prompt caching. This page is the implementation.
The pattern
Plain chunk:
"Performance improved by 24% in this configuration."
This chunk, on its own, is unembeddable for any meaningful query. "Which paper showed a 24% improvement?" cannot match it. "Which configuration improved by 24%?" cannot match it. The chunk holds an isolated claim with no contextual hooks.
Contextualized chunk:
"This chunk is from the Results section of the paper 'Improving RAG via Contextual Retrieval' by Anthropic (2024), specifically discussing the performance of the BM25 + Voyage-3 hybrid configuration. Performance improved by 24% in this configuration."
Now the chunk holds:
- The document it came from
- The section structure (results)
- The specific configuration discussed (BM25 + Voyage-3 hybrid)
- The original claim
The implementation
Pseudocode for the indexing step:
from anthropic import Anthropic
from langchain_text_splitters import RecursiveCharacterTextSplitter
client = Anthropic()
CONTEXTUALIZE_PROMPT = """\ <document> {whole_document} </document>
Here is a chunk from the document: <chunk> {chunk_content} </chunk>
Please provide a short, succinct context (50-100 tokens) to situate this chunk within the overall document for the purposes of improving search retrieval. Answer only with the context, do not add any other commentary."""
def contextualize_chunk(whole_document: str, chunk_content: str) -> str: response = client.messages.create( model="claude-haiku-4-5", # or sonnet for higher quality max_tokens=200, system=[ { "type": "text", "text": "You contextualize chunks for retrieval.", "cache_control": {"type": "ephemeral"}, } ], messages=[ { "role": "user", "content": [ { "type": "text", "text": CONTEXTUALIZE_PROMPT.format( whole_document=whole_document, chunk_content=chunk_content, ), "cache_control": {"type": "ephemeral"}, } ], } ], ) return response.content[0].text
def index_document(text: str, doc_id: str): splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " ", ""], ) chunks = splitter.split_text(text) for chunk in chunks: context = contextualize_chunk(text, chunk) embedded_chunk = f"{context}\n\n{chunk}" # Now embed embedded_chunk, store with metadata pointing to original chunk store(embedded_chunk, original=chunk, doc_id=doc_id)
The key detail: the same whole_document is passed for every chunk in that document. With prompt caching enabled (cache_control: ephemeral), the document only counts as a cache write once per indexing batch — every subsequent chunk read hits the cache at 10% of the input rate. This is what makes the cost economical.
The cost math
Anthropic's published number is $1.02 per million document tokens for the contextualization step, using Haiku and prompt caching. Let's verify.
Take a 100,000-token document split into 200 chunks (500 tokens each, plus overlap).
Without prompt caching:- 200 contextualization calls
- Each call: 100,000 input tokens (the document) + ~50 output tokens
- Input cost: 200 × 100,000 × $0.80/M (Haiku input rate) = $16.00
- Output cost: 200 × 50 × $4.00/M (Haiku output rate) = $0.04
- Total: ~$16.04 per 100k document tokens → $160 per million
- First call: cache write (100,000 tokens × $1.00/M cache-write rate = $0.10)
- Next 199 calls: cache reads (100,000 × $0.08/M cache-read rate = $0.008 each, $1.59 total) + outputs ($0.04)
- Total: ~$1.74 per 100k document tokens → ~$17 per million
With batching:
- Cache writes amortized across batches
- Batch API discount of 50%
- Effective cost: ~$1.02/M document tokens
When this is worth it
Three conditions all true:
1. Retrieval quality is your bottleneck. If your RAG is failing because the generator hallucinates or the chunks are wrong, contextual retrieval doesn't fix that. If it's failing because the retriever pulls the wrong chunks, contextual retrieval helps a lot. 2. Your corpus is under 100M tokens. At 100M, the one-time cost is $100-200. Above that, the math gets noisier — re-embedding when you change models, re-contextualizing when you update document structure. Affordable but not free. 3. Your domain is high-stakes. Legal Q&A, medical summarization, financial reporting. Domains where 67% recall improvement matters more than the $100 indexing cost.When NOT to use it:
- Low-stakes Q&A over a small corpus that already works
- Documents under 5,000 tokens (the contextualization preface bloats short chunks)
- Conversational data (the technique was tuned for prose)
- Real-time-ingestion pipelines where you need sub-second index updates (contextualization adds latency)
Composing with the rest of the stack
The 67% improvement number comes from contextualization paired with three other techniques:
- Contextual chunks (the new part)
- BM25 sparse retrieval (the boring part — top-K from BM25 plus top-K from dense)
- Cohere or Voyage reranker (rerank the union of BM25 + dense results)
query →
retrieve top-150 via BM25 →
retrieve top-150 via dense (contextualized embeddings) →
union →
rerank to top-20 via Cohere rerank-v3 →
pass to LLM
Each component contributes. Contextualization alone gave 49%; adding BM25 lifted to 57%; adding reranking lifted to 67%. The contextualization is the unique part — BM25 and reranking are widely used — but the technique pays off most when paired with the others.
Implementation gotchas
Document size limits. Claude Sonnet/Opus have a 200K-token context window; Haiku has 200K too. For documents over 200K tokens, you can't pass the whole doc as context. Two workarounds: (a) chunk the document into ~150K-token sections and contextualize chunks within their section, or (b) summarize the document first and pass the summary as context. Cache window. Prompt caching has a 5-minute default TTL (1-hour beta available). If your indexing pipeline takes more than 5 minutes per document, cache misses creep in. Either batch chunks per document tightly, or enable 1-hour cache via the beta header. Hallucination in the preface. Claude's preface can confidently misstate the chunk's topic. We've seen ~3% rate with Opus, ~7% with Haiku, ~12% with Sonnet 4.6. Three mitigations: use Opus for high-stakes corpora (more expensive), validate the preface against the chunk for keyword overlap, or accept the small noise floor. Re-contextualization on document updates. When a document changes, you have to re-contextualize all its chunks (the parent document changed). This is fine for static corpora; it's a cost driver for frequently-updated docs. Length normalization in retrieval. Contextualized chunks are longer than plain chunks. Some retrieval scoring algorithms penalize long documents — check your similarity function. Most modern embedding models handle length well, but BM25's BM25-Okapi has ab parameter that controls length normalization.
Where this fails
Anthropic's benchmark is on their corpus. The 67% number is on the corpora they tested. Your corpus may show 30% or 80%. Run the benchmark. The technique requires whole-document context at indexing time. If your data pipeline streams documents in chunks (e.g., from a CDC firehose), reconstructing the whole document at index time adds engineering complexity. Contextualization is one-shot. Once a chunk is contextualized, that context is baked in. If the document's role in your corpus changes (e.g., it becomes part of a series), the original context may be stale. We re-contextualize annually on slow-moving corpora. Smaller models hurt quality. Using Haiku for cost reasons gives you the $1.02/M number but ~7% preface hallucination rate. If quality matters more than cost, use Sonnet or Opus for contextualization and pay 5-10x more. For most production pipelines Haiku is fine. The technique amplifies bad chunking. If your chunks are wrong (semantic chunking producing 43-token fragments, see /topic/rag-chunking), contextualizing them adds preface to garbage. Fix chunking first.What to read next
- /topic/rag-chunking — the chunking strategy contextual retrieval sits on top of
- /topic/rag-frameworks — framework picker
- /topic/embeddings — embedding model selection (voyage-3-large recommended)
- /topic/vector-databases — where to store the contextualized chunks
- /topic/anthropic-prompt-caching — the cache mechanism that makes this affordable
- /topic/rag-claude-code — RAG-as-skill, often paired with contextual retrieval
- /topic/rag-mcp-server — exposing RAG over MCP
- /blog/why-semantic-chunking-lost-to-recursive — the baseline this technique improves on
Sources
- Anthropic. "Introducing Contextual Retrieval". September 2024. The original benchmark and methodology.
- Anthropic. "Contextual Retrieval cookbook". Reference implementation.
- Anthropic. "Prompt caching documentation". 5-minute default TTL, 1-hour beta, 10% cache-read rate.
- Anthropic. "Message Batches API". 50% discount, stacks with prompt caching.
- Cohere. "rerank-v3 documentation". The reranker in Anthropic's full pipeline.
- VoyageAI. "voyage-3-large announcement". Recommended embedding model for contextualized RAG.
- Eugene Yan. "LLM Patterns". Hybrid retrieval as foundation.
- PremAI / Vecta. "RAG Chunking Strategies: The 2026 Benchmark Guide". Baseline chunking benchmark.
Frequently asked
- What is Contextual Retrieval?
- A technique introduced by Anthropic in September 2024 where each chunk in a RAG index gets a 50-100 token preface generated by Claude that situates it inside its parent document. Example preface: 'This chunk is from the methods section of the paper on X, discussing the experimental setup for Y.' Anthropic's benchmark showed 67% retrieval improvement over plain chunking when paired with hybrid search and a reranker.
- How much does Contextual Retrieval cost?
- $1.02 per million document tokens, one-time at index time, when using Anthropic's prompt caching. Without caching, the cost is roughly 9x higher because each chunk's contextualization call reloads the full parent document. The 5-minute cache window during indexing is what makes the technique economically viable.
- Does Contextual Retrieval work without Claude?
- The pattern works with any LLM, but the cost economics depend on prompt caching. OpenAI added prompt caching in late 2024, Google in 2025, so it's possible elsewhere. The original benchmark and the canonical implementation use Claude. We'd run the math on your specific provider before assuming cost parity.
- When is Contextual Retrieval worth it?
- When retrieval quality is your bottleneck and your corpus is under 100M total tokens (the one-time cost amortizes). High-stakes pipelines — legal, medical, financial — where 67% recall improvement justifies the index-time cost. Probably not worth it for low-stakes Q&A over a small corpus that already works.
- How is Contextual Retrieval different from query expansion?
- Contextual Retrieval modifies the index at write time. Query expansion modifies the query at read time. They're complementary, not alternatives. Anthropic's full pattern uses contextual chunks + BM25 + dense retrieval + Cohere reranker; query expansion can layer on top. Most teams shipping the technique pair it with hybrid search.
- What's the failure mode of Contextual Retrieval?
- Three. (1) The contextualization model hallucinates — Claude's preface claims the chunk is about Y when it's actually about Z. Rare with Claude Opus, more common with smaller models. (2) The preface bloats short chunks — a 50-token chunk gets a 50-token preface, doubling the index size. (3) The technique amplifies bad chunking — if your chunks are wrong, contextualizing them just adds noise. Fix chunking before adding contextualization.