Skip to content

Topic · A4

RAG-as-MCP-Server: The Missing Pattern (2026)

Exposing a retrieval pipeline as an MCP server is the cleanest way to share RAG across Claude Code, Cursor, Codex, and any other MCP-capable agent. Here is the pattern, with code.

# RAG-as-MCP-Server: The Missing Pattern If you have a RAG pipeline today, it lives inside one tool. Probably inside a Python script that LangChain or LlamaIndex glues together. The embedder, the vector store, the reranker, the chunking strategy — all of it is local to that one process. The moment a teammate wants the same retrieval inside Cursor, or you want to call it from a CI agent, or you want Claude Code to use it without re-indexing, the integration falls apart. The fix is structural, not incremental. Expose retrieval as an MCP server. One contract, one index, every agent. The search SERP for "RAG MCP server" is wide open as of May 2026 — TheNewStack's April post is the only editorial slot in the top 10, surrounded by raw GitHub repos. The pattern is real, it works, and almost nobody has written it down properly.

Why this pattern exists

Three forces converged in early 2026:
  1. Agent diversification. Most teams now run 2-3 coding agents: Claude Code in the terminal, Cursor in the IDE, sometimes Codex in CI. The "one agent gets RAG, the others don't" problem is universal. (See cluster A6 on AGENTS.md vs CLAUDE.md — same root cause: tool-portability.)
  1. MCP became the spec. Model Context Protocol is described by its authors as "USB-C for AI" — a uniform way for agents to call external tools. As of May 2026, MCP support is shipped in Claude Code, Cursor, Codex, OpenCode, Cline, and Continue. Build once, integrate everywhere.
  1. Lazy schema loading. Claude Code in particular only loads an MCP tool's schema into context when the tool is actually called (see claude-code issue #20421). That makes MCP-exposed retrieval cheaper than CLAUDE.md-resident retrieval instructions.
The result: a RAG pipeline is no longer a feature of a single agent; it is a service the team's agents share.

What the server looks like

The minimum viable RAG MCP server exposes one tool: retrieve. A more useful one exposes three: retrieve, list_sources, and get_source. Here is a stdio-transport server in roughly 80 lines (Python, using the official mcp SDK). ``python # rag_server.py from mcp.server.fastmcp import FastMCP from qdrant_client import QdrantClient from voyageai import Client as Voyage mcp = FastMCP("rag") qdrant = QdrantClient(host="localhost", port=6333) voyage = Voyage() # uses VOYAGE_API_KEY @mcp.tool() def retrieve(query: str, top_k: int = 5) -> list[dict]: """Retrieve top_k passages relevant to query, with citations.""" embedding = voyage.embed([query], model="voyage-3-large").embeddings[0] hits = qdrant.search( collection_name="docs", query_vector=embedding, limit=top_k, with_payload=True, ) return [ { "text": h.payload["text"], "source_path": h.payload["source_path"], "score": h.score, "chunk_id": h.id, } for h in hits ] @mcp.tool() def list_sources() -> list[str]: """List all source files indexed in this RAG corpus.""" # implementation: scroll qdrant collection, dedupe by source_path ... @mcp.tool() def get_source(source_path: str) -> str: """Fetch the full text of an indexed source file.""" ... if __name__ == "__main__": mcp.run() ` Three design choices in that snippet matter:
  • Citations are first-class. Every chunk returns source_path and chunk_id. The agent uses these to cite back to the user; they are not an afterthought. (Cluster A4 §6, HN consensus: "rerankers are the highest value 5 lines of code you'll add" — but only after you fix citation pass-through.)
  • The embedder is named, not magic. Voyage-3-large beats OpenAI text-embedding-3-large by 10.58% at 1/24 the storage cost (Voyage blog, January 2025). Naming the model in the server means you can swap it without touching agent configs.
  • The server is stateless. Each retrieve call hits Qdrant and Voyage; there is no per-agent session. This is what makes CI and parallel agents safe.
To register the server with Claude Code, add it to .mcp.json: `json { "mcpServers": { "rag": { "command": "python", "args": ["rag_server.py"] } } } ` Cursor uses the same MCP config format (since Cursor 0.45). Codex CLI uses a similar one. One server, three agents.

The transport decision (stdio vs Streamable HTTP)

The 2026 MCP spec defines two transports. The choice is load-bearing. | Transport | When to use | Trade-offs | |---|---|---| | stdio | Local-only, single-user, single-machine | Fastest, no auth needed, no network surface. Server lives or dies with the agent. | | Streamable HTTP | Team-shared, remote retrieval, multi-tenant | Survives across agent restarts. Needs OAuth 2.1 with PKCE per the 2026 roadmap. Adds a network hop. | For a solo dev running RAG against their own repo: stdio. For a team sharing a corpus index hosted on a VPS: Streamable HTTP. Mixing them — stdio at dev, Streamable HTTP in CI — is fine as long as both are launched from the same server code. The
.well-known endpoint convention (added to the MCP 2026 roadmap) means Streamable HTTP RAG servers can publish their tool schemas at https://rag.yourcompany.com/.well-known/mcp for automatic agent discovery. As of May 2026, only ~30% of public MCP servers implement this; ship it if you can.

Where this pattern fails

We were going to write a "best practices" section. Here are the failure modes instead — the things you will hit if you ship this naively. 1. Token cost on large
top_k. Each chunk is ~512 tokens. retrieve(top_k=20) puts 10K tokens of retrieved context into every agent call that uses the result. Default top_k=5. Force a reranker before returning if you must return more. 2. Indexing drift. The server's index is now a piece of infrastructure. It needs a CI job that re-indexes on doc changes. Without one, the agent will cite stale chunks and you will not notice until a user complains. (HN production-RAG threads, cluster A4 §6: "73% of RAG failures are retrieval-side." Stale index is the largest sub-category.) 3. Tool-overload in the agent. If you already have 12 MCP servers wired up, adding a 13th for RAG bloats the agent's tool-list context. Run an audit before adding. Cluster A2 finds 3-5 MCP servers is the sustainable number for most projects. 4. Cross-agent prompt-injection surface. Any text in the corpus can contain injected instructions that the agent then follows ("indirect prompt injection," Microsoft Dev blog, 2025). Sanitize at index time — strip HTML comments, code-fence anything that looks like a prompt, never embed user-uploaded content without a per-source allowlist. 5. The "we built RAG once, it works everywhere" overclaim. Different agents tolerate different chunk shapes. Cursor's auto-complete cares about line-level chunks. Claude Code prefers paragraph chunks. Codex CLI handles both. If you target all three, optimize for paragraph chunks and accept Cursor will sometimes return a citation that splits mid-thought.

Indexing pipeline (separate from the server)

A common mistake: putting the indexing logic inside the MCP server itself. Don't. The server should only read the index. Indexing is a separate process — a script you run on doc changes, on a cron, or in CI.
`python # index.py — run on a schedule or doc-change webhook import voyageai from qdrant_client import QdrantClient def chunk_recursive(text: str, size: int = 512) -> list[str]: """Recursive 512-token chunking — beats semantic chunking in the Vecta Feb 2026 benchmark (69% vs 54% accuracy).""" ... def index_corpus(docs_dir: str): voyage = voyageai.Client() qdrant = QdrantClient(host="localhost", port=6333) for path in walk(docs_dir): text = open(path).read() chunks = chunk_recursive(text) embeddings = voyage.embed(chunks, model="voyage-3-large").embeddings qdrant.upsert( collection_name="docs", points=[ {"id": h(c), "vector": e, "payload": {"text": c, "source_path": path}} for c, e in zip(chunks, embeddings) ], ) ` Separation of concerns: indexing runs in a CI job, the MCP server runs alongside the agents. They share only the Qdrant collection.

How agents call this in practice

In Claude Code, after
.mcp.json registration: `
What does the chunking strategy do?
` Claude Code invokes retrieve(query="chunking strategy", top_k=5) automatically, receives the 5 chunks with source_path, and answers with citations. The user never types "use the RAG server" — the agent picks the tool because the schema description tells it the tool retrieves passages. The trick is the tool description. Write it for an LLM, not a human: `python @mcp.tool() def retrieve(query: str, top_k: int = 5) -> list[dict]: """Retrieve relevant passages from the project documentation. Use this whenever the user asks about anything documented in the codebase, architecture, or design decisions. Returns chunks with source_path for citation. Default top_k=5; raise to 10 only for broad questions.""" `` That second sentence is doing real work. Without it, Claude Code will sometimes try to read raw files instead of calling the retriever.

What to read next

Sources

  • Model Context Protocol. Official site. The "USB-C for AI" framing and the transport spec.

Related GitHub projects

Frequently asked

What is a RAG MCP server?
A RAG MCP server is a Model Context Protocol server whose primary tool is a retrieval function — typically `retrieve(query, top_k)` returning ranked chunks with citations. The agent (Claude Code, Cursor, Codex, etc.) calls the tool the same way it calls any other MCP tool; the server hides the vector DB, the embedder, the reranker, and the chunking strategy behind a single contract. The pattern matters because it lets one retrieval pipeline serve every agent your team uses, instead of re-implementing RAG inside each one.
Why expose RAG as an MCP server instead of as an inline RAG library?
Three reasons. First, portability — the same retrieval logic works in Claude Code, Cursor, Codex, OpenCode, and any other MCP client without rewriting the integration. Second, lazy schema loading — Claude Code only pays the context cost of the retrieval tool when it actually calls it, which is cheaper than always-on rules. Third, governance — you can centralize index updates, access control, and rate limits on the server side instead of letting every agent maintain its own index.
How does this differ from Anthropic's Contextual Retrieval?
Contextual Retrieval is a chunking technique — it prepends 50-100 token context to each chunk before embedding, claiming a 67% improvement on a specific eval (Anthropic, September 2024). RAG-as-MCP-server is an architectural pattern — how you expose retrieval to an agent. They compose: you can use Contextual Retrieval inside the server's indexing pipeline, then expose the result via MCP. The blog post that confuses the two is wrong on both counts.
Do I need TheNewStack's tutorial to build one?
TheNewStack's 'Build a RAG MCP server' post (April 2026) is currently the only editorial walkthrough in the top 10 search results. It is a fine starting point but skips three things this page covers: schema design for citation pass-through, transport choice (stdio vs Streamable HTTP), and how to make the server stateless enough that a CI agent can call it. Use TheNewStack to understand the shape; use this page to ship it.

Related topics