What chunk size should I use for RAG in 2026?

512 tokens with 64-token overlap, using recursive character splitting that respects sentence boundaries. Vecta's February 2026 benchmark across 50 academic papers showed 512-token recursive chunking at 69% accuracy, beating 1024-token (64%) and semantic chunking (54%). Below 256 tokens you lose multi-sentence context; above 1024 the retriever struggles to discriminate.

Should I use semantic chunking?

Probably not for prose. Vecta's benchmark showed semantic chunking shipped chunks averaging 43 tokens — too short for retrieval to work. Default LangChain SemanticChunker over-segments academic prose because the embedding-distance gradient triggers splits everywhere. Semantic chunking can help on conversational data and short documents; for everything else, recursive wins.

What's the chunk-overlap sweet spot?

12-15% of chunk size. With 512-token chunks, 64 tokens of overlap. Below 50 tokens you miss answers that straddle a boundary; above 100 tokens you bloat the index without recall gain. LangChain's default of 50 works too, but 64 has slightly better p95 recall in our testing.

Does chunking strategy matter more than the embedding model?

No. The embedding model has bigger impact. Voyage-3-large beats OpenAI text-embedding-3-large by 10.58% in Vecta's benchmark — a larger gap than any chunking change produces. Fix your embedding model first, then your reranker, then chunking. Chunking is a tier-2 optimization, not the top of the priority list.

What about markdown-aware or code-aware chunking?

Use them when the content has structure. Markdown-aware chunking scored 67% in Vecta's benchmark on documents with heading hierarchy — close to recursive 512-token. Code-aware chunking (Tree-sitter splitting on function boundaries) wins on codebases. Structure-aware chunking exploits a signal recursive ignores; use it when the signal exists.

Should I move to contextual retrieval after recursive chunking?

Yes if quality matters. Anthropic's Contextual Retrieval prefaces each chunk with a 50-100 token contextual blurb generated by Claude. Their benchmark showed 67% improvement over plain recursive chunking. Cost is $1.02 per million document tokens, one-time, with prompt caching. For pipelines where retrieval quality is the bottleneck, this is the next upgrade — not 'tune semantic chunking harder.'

RAG Chunking Strategy 2026 (Recursive Beats Semantic)

For two years the RAG advice was: graduate from naive fixed-size chunking to semantic chunking. Split on sentence boundaries, group by embedding similarity, each chunk holds a coherent unit of meaning. Vendors recommended it. Tutorials recommended it. LangChain and LlamaIndex shipped semantic chunkers as the "advanced" option.

In February 2026 Vecta published a benchmark that flipped the consensus. Across 50 academic papers, identical embedding model, identical retrieval logic:

Recursive 512-token chunking: 69% retrieval accuracy
Semantic chunking: 54%

A 15-point gap, in the wrong direction. The "advanced" technique underperformed the baseline. This page is the chunking guide that updates with the data — what to ship, why, and where the exceptions live.

For the full benchmark walkthrough and the "why" of semantic chunking's failure, see /blog/why-semantic-chunking-lost-to-recursive. This page is the operational guide.

The strategy that ships

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(     chunk_size=512,     chunk_overlap=64,     separators=["\n\n", "\n", ". ", " ", ""],     is_separator_regex=False, )

Three settings, each with a reason:

chunk_size=512. Token-level (use a tokenizer like tiktoken, not characters). Vecta's benchmark shows 512 outperforms 1024 in academic prose. Below 512 you lose multi-sentence context; above 512 retrieval struggles to discriminate because each chunk holds too much. 512 is the production sweet spot. chunk_overlap=64. Roughly 12-15% of chunk size. Less and you miss answers that straddle boundaries; more and you bloat the index without recall gain. LangChain's default 50-token overlap works; 64 has slightly better p95 recall in our internal testing. separators in this order. Paragraph → line → sentence → word → character. The recursive splitter tries each in order, so it falls back to mid-word splits only when a chunk would otherwise exceed chunk_size. This is what makes recursive chunking "sentence-respecting" by default.

In TypeScript via @langchain/textsplitters:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const splitter = new RecursiveCharacterTextSplitter({   chunkSize: 512,   chunkOverlap: 64,   separators: ["\n\n", "\n", ". ", " ", ""], });

LlamaIndex equivalent:

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(     chunk_size=512,     chunk_overlap=64, )

Different libraries, same idea. The recursive sentence-respecting approach wins across implementations.

The five chunking strategies, by domain

Strategy	Best for	Vecta score (prose)	Notes
Recursive 512-token	Long-form prose, technical docs	69%	The default ship
Recursive 1024-token	Long prose, low recall budget	64%	Loses to 512 in benchmark
Markdown-aware	Documents with heading hierarchy	67%	Use when structure exists
Code-aware (Tree-sitter)	Source code	N/A	Splits on function boundaries
Semantic	Conversational data, short docs	54%	Lost the prose benchmark
Token-fixed (no boundary respect)	Never	51%	Always inferior

The implicit pattern: structure beats semantics, prose-with-overlap beats prose-without, sentence-respecting beats character-respecting. Recursive chunking with the right separators captures all three.

Where each strategy wins

Recursive — the default

Long-form prose, academic papers, technical documentation, blog posts, legal briefs. The benchmark shows it wins on this content type clearly. Use it unless you have a specific reason not to.

Markdown-aware

Documents with clear heading hierarchy. Split on #, ##, ### boundaries first; fall back to recursive within sections. The heading structure encodes semantic boundaries that the document author intended. Use when the source content is markdown or has consistent heading conventions.

Code-aware

Source code. Splitting on function/class boundaries respects the unit-of-meaning that exists naturally in code. The tree-sitter library provides language-aware splitters that understand function and class definitions across 60+ languages. Far superior to recursive splitting on code.

Semantic

Conversational data — chat logs, meeting transcripts, support ticket histories. Speaker turns are natural semantic boundaries; embedding similarity between turns by the same speaker is high, between different speakers' turns is lower. Semantic chunking respects this where fixed-token splits don't.

Also: very short documents (under 2,000 tokens total). When the whole doc is small, recursive's "split into 4 chunks of 500 each" is arbitrary; semantic can produce 2-3 meaningfully different chunks.

Custom domain-specific

Some domains have structure neither recursive nor markdown captures. Legal contracts have numbered clauses. Medical records have section headers like "HISTORY OF PRESENT ILLNESS." Patent filings have claim boundaries. For these, write a domain-aware splitter that respects the structure. Worth the engineering time when the structure is consistent.

Chunk overlap details

64-token overlap is the default we ship. Three details:

Overlap goes at the start of the next chunk. The recursive splitter takes the last 64 tokens of the previous chunk and prepends them to the next. The retrieval-time effect: any answer that spans a chunk boundary appears in both chunks, so the retriever picks up either one. Overlap costs index size, not retrieval quality. A pipeline with overlap has roughly 12% more chunks than one without. Storage cost up; recall up; query latency unchanged. Higher overlap (100+) is for specific cases. Very dense content (academic abstracts, legal definitions) where almost every sentence introduces a new concept. The diminishing returns kick in fast — 100 tokens is the practical max we'd recommend.

Token vs character counts

A common bug: chunk_size=512 in LangChain's RecursiveCharacterTextSplitter defaults to characters, not tokens. 512 characters is roughly 100-150 tokens depending on content — much smaller than intended.

Fix: pass a token counter:

import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")

splitter = RecursiveCharacterTextSplitter(     chunk_size=512,     chunk_overlap=64,     length_function=lambda x: len(encoding.encode(x)), )

This is the most common silent-bug we see in RAG pipelines: chunk sizes are wrong by 3-5x because the splitter is counting characters and the developer thought tokens. Always pass the token counter when you mean tokens.

When to upgrade beyond recursive

Two upgrades that beat recursive at the next quality tier.

Anthropic's Contextual Retrieval

The pattern: each chunk gets a 50-100 token preface generated by Claude that situates it inside the document. "This chunk is from the methods section of the paper on X, discussing the experimental setup for Y."

Anthropic's benchmark: 67% improvement in retrieval accuracy over plain recursive chunking. Cost: $1.02 per million document tokens once, at index time, with prompt caching. After that, no per-query cost — the contextualization is baked into the chunk text.

This is the upgrade we ship for high-stakes pipelines. Full implementation at /topic/contextual-retrieval.

Parent-document retriever / small-to-big

Index small chunks (for retrieval precision) and large parent chunks (for generation context). At query time, retrieve the small chunk, then return the parent chunk to the LLM. Wins when the answer is in a sentence but the context needed is the surrounding paragraph.

LangChain's ParentDocumentRetriever and LlamaIndex's RecursiveRetriever both implement this. The trade-off: complexity (two index layers, two ID systems) for quality (better context per retrieved unit).

Where this fails

The Vecta benchmark covers academic prose. Other corpus types — legal contracts, source code, multilingual docs — weren't in the test set. The 69% recursive / 54% semantic gap might be smaller (or larger) on your corpus. Run the benchmark on your data before generalizing. Embedding model matters more. Voyage-3-large outperforms OpenAI text-embedding-3-large by 10.58% in the same benchmark — a bigger lift than any chunking change. If you're optimizing chunking before optimizing the embedding model, priorities are inverted. See /topic/embeddings. Reranking dominates chunking. Adding a Cohere or Voyage reranker on top of any chunking strategy lifts recall by 15-25 points. Reranker first, chunking second. LangChain's SemanticChunker is one implementation. Other semantic chunkers tune the percentile threshold differently. Vecta didn't sweep these. Possible (not proven) that a properly-tuned semantic chunker could close the gap. We'd run our own test before assuming. Late-chunking is research, not production. "Embed the whole document, then chunk the embeddings" is interesting in the literature but not production-stable as of May 2026. Re-evaluate in 6 months.

Sources

PremAI / Vecta. "RAG Chunking Strategies: The 2026 Benchmark Guide". February 2026. 5 strategies × 50 papers, voyage-3-large embedding.
Anthropic. "Introducing Contextual Retrieval". September 2024. 67% lift, $1.02/M document tokens.
LangChain. RecursiveCharacterTextSplitter docs.
LangChain. SemanticChunker docs. Default 95th-percentile threshold.
LlamaIndex. SentenceSplitter docs.
Tree-sitter. Project home. Language-aware code splitting.
VoyageAI. voyage-3-large announcement. +10.58% over OpenAI text-embedding-3-large.
Eugene Yan. "LLM Patterns". Hybrid retrieval, reranker primacy.
HN 45645349. "Production RAG at 5M+ docs".