Skip to content

Topic · A4

RAG Framework Picker: LangChain, LlamaIndex, DSPy, Haystack, RAGFlow

Five RAG frameworks dominate production usage in 2026. Each is good at something specific and bad at something else. This is the picker — what each framework optimizes for, when it loses, and the configuration we'd actually ship.

In May 2026 the RAG framework landscape has consolidated around five frameworks that account for the majority of production usage. They are not interchangeable. Each was built for a different problem, optimizes for a different metric, and falls down in a different way. Picking the wrong one means rewriting your retrieval pipeline; picking the right one means you spend your time tuning instead of fighting the framework.

This page is the picker. What each framework does well, where it loses, the production stack we'd ship for common workloads, and when the right answer is "no framework, just write the 150 lines yourself."

The five

LangChain — 105,000 stars, the default

What it is: the largest RAG framework by mind-share and integrations. Wraps every model, every vector DB, every embedding model, every document loader. Python and TypeScript first-class.

What it does well: integrations. If a vendor has a Python SDK, LangChain has a wrapper. Switching from OpenAI to Anthropic to Cohere is a one-line config change. The community is large enough that almost every common problem has a Stack Overflow answer or LangChain doc page.

Where it loses: the abstractions get in the way at scale. HN consensus from production-RAG threads is consistent: "LangChain abstracts you away from the parts you actually need to tune." LCEL (LangChain Expression Language) is powerful but the learning curve is steep, and debugging a long chain is harder than debugging the equivalent direct calls. At 5M+ docs production scale, teams report rewriting the hot path to skip LangChain.

When to use it: most production setups under 1M docs. Teams new to RAG. Workflows that span many providers. Prototyping.

LlamaIndex — 40,800 stars, the retrieval-quality choice

What it is: framework built around document ingestion and retrieval as first-class concepts. Where LangChain is "wrap every model," LlamaIndex is "do retrieval well." Jerry Liu, the founder, has been publicly opinionated about retrieval mattering more than orchestration.

What it does well: ingestion of structured documents, multi-step retrieval (parent-document, hypothetical-questions, RAG-Fusion), and agentic retrieval where the agent decides what to retrieve next. The Recursive Retriever and Auto-Retriever patterns are LlamaIndex inventions and they're production-tested.

Where it loses: smaller integration surface than LangChain. Less mainstream familiarity — onboarding a new team member takes longer because there are fewer Stack Overflow answers. The "agentic retrieval" pattern is powerful but expensive (multiple LLM calls per query).

When to use it: document-heavy workloads. Multi-hop retrieval (answer requires synthesis across passages). When retrieval quality is the primary bottleneck.

DSPy — 23,000 stars, programmatic prompt optimization

What it is: Stanford research framework for writing RAG (and other LLM) pipelines as composable modules with declared input/output signatures, then optimizing the prompts inside them automatically against a metric. Not a wrapper, not orchestration — a different programming model.

What it does well: when prompts have many degrees of freedom and you can define a metric (RAGAS score, accuracy on a test set, judge-LLM score), DSPy compiles a better prompt than a human would write. The bootstrapped few-shot and MIPRO optimizers reliably outperform hand-written prompts on the right problems.

Where it loses: the learning curve is steep. The "compile" step is itself an LLM workflow that costs tokens and time — a first compilation can take 30+ minutes and $20+ in tokens. Teams that ship every two weeks don't want a 30-minute step in the loop.

When to use it: high-stakes RAG where you have a metric and the compilation cost amortizes (compile once, ship for months). Research-grade work. Optimization on top of an existing LangChain or LlamaIndex pipeline.

Haystack — 20,200 stars, the enterprise pick

What it is: deepset's framework, built around pipelines as a first-class concept with strong observability and component-level isolation. Originally an information-retrieval framework that pivoted into LLM-era RAG.

What it does well: regulated-industry production. Component-level versioning, strong audit logging, on-prem-friendly deployment. Haystack pipelines can be exported as YAML and version-controlled separately from code. The opinionated structure pays off for teams that need to explain to compliance what the pipeline does.

Where it loses: speed of iteration. The structure that wins for compliance loses for experimentation. Less hot in the open-source community than LangChain/LlamaIndex — fewer recent tutorials, smaller plugin ecosystem.

When to use it: financial services, healthcare, government. Anywhere "explain the pipeline to an auditor" is a real requirement. Anywhere on-prem deployment matters.

RAGFlow — 48,500 stars, the document-extraction choice

What it is: InfiniFlow's open-source RAG framework specialized for parsing complex documents — PDFs with tables, charts, multi-column layouts, scanned content.

What it does well: document understanding. RAGFlow's parser handles structured PDFs (financial reports, legal contracts, medical records) substantially better than LangChain's default PyPDF or PDFMiner loaders. Table extraction works on tables that other frameworks turn into garbage.

Where it loses: less flexibility outside the document-heavy use case. The opinionated UI and pipeline structure don't fit all teams. Less developed support for streaming or low-latency inference.

When to use it: knowledge bases over PDFs. Compliance review over contracts. Medical-record Q&A. Anywhere the bottleneck is "the framework can't read my source documents" rather than "the framework can't orchestrate well."

The framework that doesn't exist on this list: yours

A minimal RAG pipeline is ~150 lines:

# Pseudocode for a minimal Python RAG
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import qdrant_client

embedder = SentenceTransformer("BAAI/bge-m3") client = OpenAI() db = qdrant_client.QdrantClient(":memory:")

def chunk(text, size=512, overlap=64): # recursive char splitter, ~30 lines ...

def index(documents): for doc in documents: for c in chunk(doc.text): v = embedder.encode(c) db.upsert("docs", {"vector": v, "payload": {"text": c, "doc": doc.id}})

def retrieve(query, k=10): v = embedder.encode(query) return db.search("docs", v, limit=k)

def generate(query, context): return client.chat.completions.create( model="gpt-5.4", messages=[ {"role": "system", "content": "Answer using only the provided context."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"} ] )

For most teams' RAG, this is enough. The framework adds nothing the team couldn't write themselves. The HN production-RAG threads at 5M+ docs scale consistently report: "we started with LangChain, rewrote in 200 lines, never looked back."

The case for a framework: integrations (you need 5+ providers swappable), team size (more than 3 engineers means more abstraction helps), and tooling (LangSmith, observability that hooks into LangChain natively).

The case against: simplicity. Fewer moving parts. Less rewrite-when-the-abstraction-leaks.

The production stack we'd actually ship

Independent of framework:

  1. Chunking: recursive 512-token with 64-token overlap. Not semantic (semantic chunking lost in the Vecta 2026 benchmark, see /blog/why-semantic-chunking-lost-to-recursive).
  2. Embedding model: voyage-3-large if you have budget; BGE-M3 self-hosted if cost-sensitive. Both beat OpenAI text-embedding-3-large in independent benchmarks. See /topic/embeddings.
  3. Vector DB: pgvector if you already run Postgres and you're under 10M vectors; Qdrant self-hosted on a $30/mo VPS for cost; Pinecone if you want managed and have the budget. See /topic/vector-databases.
  4. Hybrid search: BM25 alongside vector. Recall lifts 15-25 points over vector-only.
  5. Reranker: Cohere rerank-v3 or voyage rerank-2. Add this before "is the framework choice my problem." It is not.
  6. Contextual retrieval: Anthropic's pattern (/topic/contextual-retrieval) for high-stakes pipelines. 67% lift, $1.02/M document tokens once.
  7. Generation: Claude Opus 4.7, GPT-5.4, or Gemini 3.1 Pro depending on cost/quality trade-off. Mix via gateway (/topic/llm-gateway-decision).
The framework decision matters less than this stack. A bad stack in LangChain is bad. A good stack in 150-line-Python is good.

When framework choice does matter

Three cases.

Multi-provider routing as a feature. If you need to switch providers at runtime based on cost or availability, LangChain's swap-the-model-string pattern is real value. Building this from scratch is 20+ lines of error-handling per provider. Long-context agentic retrieval. LlamaIndex's Auto-Retriever and recursive patterns are nontrivial to reimplement. If your retrieval pattern is "agent decides what to retrieve next based on what it found," LlamaIndex is the shortest path. Compliance and audit. Haystack's YAML-pipeline-as-config is a real win for regulated environments. Telling an auditor "the pipeline is in this YAML file, version-controlled separately, with these components" is easier than telling them "it's distributed across these 8 Python files."

Where this fails

Frameworks change quickly. LangChain shipped breaking changes between v0.1, v0.2, and v0.3 in 2024-2025. LlamaIndex did similar restructuring. Lock to a minor version and budget for migrations. The 150-line custom pipeline is more stable precisely because there are fewer moving parts. Star count doesn't equal production quality. Dify (90.5k stars) is huge but heavily no-code; production teams ship less Dify than the star count suggests. RAGFlow (48.5k) is genuinely production-quality but specialized. The HN dissent is real. "LangChain abstracts you away from the parts you need to tune" is a recurring complaint at scale. Don't dismiss it. Plan to outgrow your framework if your traffic warrants it. Frameworks don't fix retrieval. A bad chunking strategy, a bad embedding model, or a missing reranker will produce bad RAG regardless of framework. If your pipeline doesn't work, fix the retrieval before changing the framework.

What to read next

Sources

Frequently asked

Which RAG framework should I use?
LangChain for breadth of integrations and team familiarity (105k stars, most documented). LlamaIndex for retrieval-quality and document-heavy work (40.8k stars, 'retrieval king'). DSPy when you want to optimize prompts programmatically (23k stars, Stanford). Haystack for enterprise/regulated industries with strict observability needs (20.2k stars). RAGFlow for document-heavy domains with complex table extraction (48.5k stars). Pick by job, not popularity.
Is LangChain dying?
No, but the criticism is real. The HN consensus is that LangChain's abstractions get in the way of the parts you actually need to tune. Teams still ship LangChain in production at scale — Uber, Klarna, others — but a growing minority writes their own minimal pipeline. The framework choice matters less than the retrieval discipline.
DSPy vs LangChain — which?
Different categories. DSPy is a programmatic-prompt-optimization layer; LangChain is an orchestration library. You can run DSPy inside a LangChain pipeline. Use DSPy when prompts have many degrees of freedom and you want to optimize against a metric. Use LangChain when you need to wire components together. Most production RAG uses LangChain + an embedded reranker, not DSPy directly.
Do I need a framework at all?
No. A minimal RAG pipeline is ~150 lines of Python or TypeScript: chunk → embed → store → retrieve → rerank → generate. HN engineers running RAG at 5M+ docs frequently report that the framework abstractions cost them more than they saved. If your pipeline is stable and small, frameworkless is reasonable.
What's the production RAG stack we'd actually ship?
Recursive 512-token chunking → voyage-3-large (or BGE-M3 self-hosted) → pgvector or Qdrant → Cohere rerank-v3 or voyage rerank → Claude or GPT-5 for generation. Wrap in either LangChain or your own minimal pipeline. The framework matters less than the retrieval pipeline.
Is RAGFlow the right choice for PDF-heavy work?
Often yes. RAGFlow's table extraction and PDF parsing are stronger than LangChain's defaults. The 48.5k stars are concentrated in document-heavy verticals (legal, healthcare, financial reporting). If your corpus is mostly clean prose, LangChain or LlamaIndex are easier. If it's PDFs with embedded tables, charts, and inconsistent formatting, RAGFlow's parser is worth the framework lock-in.

Related topics