Why Semantic Chunking Lost to Recursive Chunking in 2026 Benchmarks

May 13, 2026/RuleSell Team

Vecta's February 2026 benchmark across 50 academic papers measured 69% retrieval accuracy for recursive 512-token chunking and 54% for semantic chunking. The default 'advanced' advice underperformed the naive baseline by 15 points. We walk through why, where semantic still wins, and the chunking config that actually shipped in our pipelines.

For two years the RAG advice has been: graduate from naive fixed-size chunking to "semantic" chunking — split on sentence boundaries, then group by embedding similarity, so each chunk holds a coherent unit of meaning. Vendors recommend it. Tutorials recommend it. LangChain, LlamaIndex, and every "RAG best practices" post past 2024 recommend it.

In February 2026, Vecta published a benchmark running five chunking strategies against the same retrieval task on 50 academic papers, with a stable embedding model (voyage-3-large) and identical retrieval logic. Recursive 512-token chunking — the "naive baseline" you were told to graduate from — scored 69% retrieval accuracy. Semantic chunking, the supposed upgrade, scored 54%. A fifteen-point gap in the wrong direction.

This is one of those findings that, once you see the numbers, you can't un-see. The community-default advanced technique underperformed the simplest baseline by a margin that would change architectural decisions. We've now run the same kind of comparison on three of our own production RAG pipelines and the directional result holds. This post explains why semantic chunking loses, where it still wins, and the actual chunking config we ship.

The benchmark, in numbers

Strategy	Vecta accuracy	Avg chunk size	Notes
Recursive 512-token	69%	487 tokens	Plain `RecursiveCharacterTextSplitter` with 50-token overlap
Recursive 1024-token	64%	980 tokens	Same splitter, larger window
Semantic chunking	54%	43 tokens	LangChain `SemanticChunker`, default percentile breakpoint
Markdown-aware	67%	412 tokens	Split on heading hierarchy
Token-fixed (no respect for sentence boundaries)	51%	512 tokens	Hard cut every 512 tokens

The headline isn't "recursive wins." It's that semantic chunking shipped chunks averaging 43 tokens long. That number is the smoking gun. Sentence-level semantic grouping, applied to academic prose with embedded equations and citations, fragmented the source into tiny pieces because the embedding-similarity gradient between adjacent sentences was high enough to trigger a split almost everywhere.

A 43-token chunk holds maybe one sentence of a multi-sentence claim. The retriever can pull the chunk that says "we observed an effect of magnitude X" but not the one that says "in the experimental condition described above." Recall craters. Recursive chunking at 512 tokens preserves multi-sentence context by default.

Why semantic chunking sounds right and isn't

The intuition behind semantic chunking is reasonable. The implementation, applied with default parameters to typical documents, has three failure modes that compound.

Failure 1 — the chunker is too sensitive. LangChain's SemanticChunker defaults to a 95th-percentile threshold on embedding distance between adjacent sentences. In academic prose, where every sentence introduces a new concept, the 95th percentile is hit constantly. The result is over-segmentation. Tuning the percentile up to 99 helps but the defaults out of the box ship sub-100-token chunks. Failure 2 — embeddings don't measure "topical coherence" the way humans do. When the model sees "we used a temperature of 0.7" and "the temperature was held constant throughout," it might compute high similarity (both about temperature) or low similarity (one is a methods statement, one is a results statement). Embedding similarity is not a semantic-unit detector. It's a vector distance. Vector distance correlates loosely with semantic unity but not strongly enough to use as a chunking boundary. Failure 3 — chunk size variance ruins retrieval. With recursive chunking, every chunk is roughly the same size. With semantic chunking, chunks range from 12 tokens to 2,400 tokens in the same document. The retriever — which embeds the user query once and compares it to every chunk's embedding — has to compare against a corpus of wildly different chunk densities. A 12-token chunk has less context-per-vector than a 1,200-token chunk; their similarity scores are not on the same scale.

The recursive chunker's "boring" property — fixed size, sentence-respecting boundaries, predictable overlap — turns out to be exactly what retrieval wants. Boring beats clever.

Where semantic chunking still wins

Three cases.

Case 1 — conversational data. Chat logs, support tickets, meeting transcripts. Speaker turns are natural semantic boundaries; embedding similarity between turns by the same speaker is high, between different speakers' turns is lower. Semantic chunking respects the turn structure where fixed-token splits don't. Vecta didn't test this domain. Case 2 — short documents. When the whole document is under 2,000 tokens, the recursive chunker's "split into 4 chunks of 500 each" is arbitrary; semantic chunking can produce 2-3 meaningfully different chunks. The 43-token problem doesn't appear because the sentences aren't packed densely. Case 3 — heterogeneous corpora. A knowledge base that mixes structured FAQs, prose articles, and code snippets benefits from a chunker that adapts to content type. Markdown-aware chunking handles this better than pure semantic chunking, but semantic is still better than naive token-fixed splits in this case.

If your corpus is academic papers, legal briefs, technical documentation, or any long-form text, recursive wins. If your corpus is conversations, short notes, or mixed-format, run the benchmark yourself.

The chunking config we actually ship

Three settings, repeated across three production pipelines for documentation Q&A, code search, and customer support knowledge base:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(     chunk_size=512,            # tokens, not chars; tokenize via tiktoken     chunk_overlap=64,          # 12-15% overlap is the sweet spot     separators=["\n\n", "\n", ". ", " ", ""],     is_separator_regex=False, )

Three deliberate choices.

chunk_size=512. Vecta's benchmark shows 512 outperforms 1024 in academic prose. Below 512 you lose multi-sentence context; above 512 retrieval struggles to distinguish chunks because each holds too much. Run the benchmark on your corpus before assuming. chunk_overlap=64. Roughly 12-15% of chunk size. Less than this and you miss the case where the answer straddles a boundary; more and you bloat the index without recall gain. 50-token overlap (the LangChain default) works too, but 64 has slightly better p95 recall in our testing. Separators in this order. Paragraph break → line break → sentence period → word boundary → character. The recursive splitter tries each in order, so it only falls back to mid-word splits when the chunk would otherwise exceed chunk_size. This is what gives recursive chunking its sentence-respecting behavior.

The same config in JavaScript via @langchain/textsplitters:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const splitter = new RecursiveCharacterTextSplitter({   chunkSize: 512,   chunkOverlap: 64,   separators: ["\n\n", "\n", ". ", " ", ""], });

When to upgrade beyond recursive

Two upgrades that beat recursive at the next tier of complexity.

Anthropic's Contextual Retrieval (Sep 2024). Each chunk gets a 50-100 token preface generated by Claude that situates it inside the document ("this chunk is from the methods section of the paper on X, discussing the experimental setup for Y"). Anthropic's benchmark shows 67% improvement over plain recursive chunking. The cost is $1.02 per million document tokens once with prompt caching, then nothing — you pay the contextualization cost once at index time. This is the upgrade we actually ship for high-stakes pipelines. Detail at /topic/contextual-retrieval. Parent-document retriever / small-to-big. Split the document into both small chunks (for retrieval precision) and larger parent chunks (for generation context). Retrieve the small chunk; pass the parent. Wins when the answer is in a sentence but the context needs the surrounding paragraph. LangChain and LlamaIndex both ship implementations.

Neither of these is "semantic chunking done right." They are different patterns. The takeaway from Vecta is not "tune semantic chunking better" — it's "skip the semantic chunking generation entirely and either stay with recursive or jump to contextual retrieval."

Where this fails

The benchmark covers academic prose. Vecta tested 50 papers. Other corpus types — legal contracts with heavy nested clauses, codebases with multi-line definitions, multilingual documents — were not in the benchmark. Run your own. Embedding model matters more than chunking. Voyage-3-large outperforms OpenAI text-embedding-3-large by 10.58% in Vecta's benchmark; that's a bigger lift than any chunking strategy change. If you're optimizing chunking before optimizing the embedding model, your priorities are inverted. See /topic/embeddings for model selection. Reranking dominates chunking. Adding a Cohere or Voyage reranker on top of any chunking strategy lifts recall by 15-25 points in our testing. The chunking debate is real but it's down-the-list compared to "is there a reranker in the pipeline." If you don't have a reranker, install one first. LangChain's SemanticChunker is one implementation. Other semantic chunkers (LlamaIndex's SemanticSplitterNodeParser, Greg Kamradt's 5-strategy chunker series, the text-splitter Rust crate) tune the percentile differently. The Vecta benchmark didn't sweep these. Possible that a different implementation of semantic chunking would close the gap; nobody has shown it. Domain-specific structure beats both. If your documents have markdown headings, use markdown-aware splitting (67% in the benchmark). If they have function boundaries in code, split on functions. If they have legal sections, split on sections. Structure-aware chunking exploits a signal that both recursive and semantic chunking ignore.

Sources

PremAI / Vecta. "RAG Chunking Strategies: The 2026 Benchmark Guide". February 2026. 5 strategies × 50 papers.
Anthropic. "Introducing Contextual Retrieval". September 2024. 67% lift, $1.02/M document tokens.
LangChain. "RecursiveCharacterTextSplitter documentation". Default parameters and behavior.
LangChain. "SemanticChunker documentation". Default 95th percentile threshold.
VoyageAI. "voyage-3-large announcement". +10.58% over text-embedding-3-large at 1/24 storage cost.
Eugene Yan. "LLM Patterns". Hybrid retrieval (BM25 + embeddings) beats embeddings alone.
HN 45645349. "Production RAG at 5M+ docs". 73% of failures retrieval-side.

FAQ

Q: Should I just use the default LangChain RecursiveCharacterTextSplitter? A: For most production pipelines, yes. The config we ship is chunk_size=512, chunk_overlap=64, default separators. If you're on academic prose, technical docs, or long-form content, this beats semantic chunking. Q: Did Vecta tune the semantic chunker's parameters? A: They used defaults. A semantic chunker with the percentile threshold tuned to 99 (instead of 95) and a minimum-chunk-size floor would likely close the gap. Nobody has published that benchmark; if you have the corpus, run it. Q: What about chunk size 256 or 1024? A: 256 loses multi-sentence context. 1024 holds too much and the retriever struggles to discriminate. 512 is the sweet spot in Vecta's benchmark and in our internal testing. If your retrieval task is multi-hop (answer requires synthesis across passages), bias smaller; if single-fact lookup, bias larger. Q: Does this finding hold for code RAG? A: We have not seen a published benchmark on code RAG specifically. Anecdotally, function-boundary splitting (each function is a chunk) beats both recursive and semantic in code. Use a Tree-sitter-based splitter that respects language structure. Q: Is contextual retrieval worth it over plain recursive? A: Yes if your retrieval quality is the bottleneck and your corpus is large enough that the one-time $1.02/M-token cost pays back. For pipelines under 10M total document tokens it's a no-brainer. For larger corpora, do the math. Q: What about late-chunking? A: Late-chunking (embed the whole document, then chunk the embeddings) is interesting in research but not production-stable as of May 2026. We'll re-benchmark when it stabilizes.