Topic · A4
RAG Inside a Claude Code Project (as a Skill)
Anthropic owns 'contextual retrieval' definitionally. RuleSell owns RAG-as-a-Claude-Code-skill pragmatically. Here is the one-click skill bundle pattern, with the indexing job and the install.
# RAG Inside a Claude Code Project (as a Skill) Anthropic published "Introducing Contextual Retrieval" in September 2024 and owns that phrase definitionally. Search "contextual retrieval Anthropic" and the top result will always be theirs. But there is a different question they did not answer: how do I add RAG to my existing Claude Code project without leaving Claude Code? Not the chunking technique. The packaging. The "where does the skill live, what does the SKILL.md say, how does it trigger" question. The answer is a Claude Code skill. We have shipped this pattern in five RuleSell projects and the install cost is roughly one engineer-day. Here is the entire pattern, including the trade-offs you will hit.
The skill vs MCP vs CLAUDE.md decision
There are three places retrieval logic can live in a Claude Code project. The decision is load-bearing because it determines how much context is spent before the user even asks a question. | Location | When to use | Context cost | |---|---|---| | CLAUDE.md instructions ("retrieve from docs/ when answering") | Project under 50 files, all docs are short | Always loaded — high, every turn | | Claude Code skill | Project-local corpus, one team, one tool | Triggered — zero until invoked | | MCP server | Multi-agent (Cursor + Codex + Claude Code), shared corpus, remote retrieval | Tool schema only loaded on call (Claude Code 2026+) | If the answer is "we are a small team, the docs live in this repo, and only Claude Code reads them," the skill wins. Always-on CLAUDE.md instructions for retrieval are the anti-pattern HumanLayer warns about — they bloat context and reduce the model's ability to recall what it actually needs ("context rot," per Anthropic's engineering blog).What the skill looks like
The minimum viable skill is two files: ``
.claude/skills/rag/
SKILL.md
scripts/
retrieve.py
`
SKILL.md
`markdown
name: rag
description: |
Retrieve relevant passages from this project's documentation when the user
asks any question that is documented in /docs, /architecture, or any .md
file in the repo. Use whenever the user's question references a concept,
command, or design decision that might be documented. Returns 5 chunks with
source paths for citation.
# RAG Skill
When triggered, run:
python .claude/skills/rag/scripts/retrieve.py ""
The script returns JSON with shape:
{"chunks": [{"text": "...", "source_path": "...", "score": 0.83}, ...]}
Cite each chunk inline using the source_path. If no chunk scores above 0.5,
say so explicitly — do not make up content not present in the chunks.
`
The description field is the most important text in this file. Claude Code uses it to decide when to trigger the skill. Two failure modes to avoid:
- Too vague. "Search the docs" triggers on every question and bloats every turn. The description above is specific — it names the directories and the chunk count.
- Too narrow. "Search docs/architecture.md" only triggers for that exact word. Wider is usually better, with named directories instead of named files.
scripts/retrieve.py
`python
import sys, json
from qdrant_client import QdrantClient
from voyageai import Client as Voyage
query = sys.argv[1]
voyage = Voyage()
qdrant = QdrantClient(host="localhost", port=6333)
embedding = voyage.embed([query], model="voyage-3-large").embeddings[0]
hits = qdrant.search(
collection_name="project_docs",
query_vector=embedding,
limit=5,
with_payload=True,
)
print(json.dumps({
"chunks": [
{
"text": h.payload["text"],
"source_path": h.payload["source_path"],
"score": h.score,
}
for h in hits
]
}))
`
Note what is not in the skill: the indexing logic. That belongs in a separate scripts/index.py triggered by CI on doc changes. Mixing them is a common mistake — the skill becomes slow because every call re-indexes, and the indexer never runs because nobody remembered to schedule it.
The indexing job (separate)
`python
# scripts/index.py — run in CI on push to main
import voyageai, glob, hashlib
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
voyage = voyageai.Client()
qdrant = QdrantClient(host="localhost", port=6333)
qdrant.recreate_collection(
"project_docs",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
def chunk_recursive(text: str, size: int = 512) -> list[str]:
"""Recursive character splitter, ~512 tokens.
Beats semantic chunking in the Vecta Feb 2026 benchmark (69% vs 54%)."""
...
for path in glob.glob("docs//.md", recursive=True):
text = open(path).read()
chunks = chunk_recursive(text)
embeddings = voyage.embed(chunks, model="voyage-3-large", output_dimension=1024).embeddings
qdrant.upsert(
"project_docs",
points=[
{
"id": int(hashlib.sha256((path + c[:50]).encode()).hexdigest()[:8], 16),
"vector": e,
"payload": {"text": c, "source_path": path},
}
for c, e in zip(chunks, embeddings)
],
)
`
Three opinionated choices in that snippet:
- Recursive 512-token chunking — Vecta's February 2026 benchmark across 50 academic papers found recursive beat semantic chunking 69% to 54%. Most "advanced RAG" tutorials lead with semantic; the empirical evidence says don't.
- Voyage-3-large at 1024 dimensions — beats OpenAI text-embedding-3-large by 10.58% on Voyage's own benchmark, at 1/24 the storage cost when using int8 at 512 dims (Voyage AI, January 2025). The 1024-dim float setting in this script is the cost-quality balance; drop to 512-dim int8 if storage matters more than the last ~3% recall.
- Qdrant at localhost — fine for solo dev. For a team, point at a managed Qdrant Cloud collection or self-host on Hetzner ($30/mo runs 10M+ vectors comfortably; cluster A4 §12 numbers).
The one-click skill bundle pitch
The pattern above is reproducible but not painless. The first-time setup hits four sharp edges:
- Voyage API key vs OpenAI API key — which to start with, when to switch.
- Qdrant vs pgvector vs Chroma — pgvector is enough for 70% of teams (cluster A4 surprise finding; Supabase HNSW benchmark beats Qdrant at 1M scale).
- Chunk size tuning — 512 is the median; legal docs want 256, code wants 128, narrative docs want 1024.
- The skill description routing — vague descriptions trigger constantly, narrow ones trigger never.
The RuleSell skill bundle compresses these into an npx skills add rulesell/rag-claude-code install that:
- Drops the skill folder into
.claude/skills/rag/
- Writes a
scripts/index.py parameterized for your stack (pgvector or Qdrant)
- Adds a
.github/workflows/reindex.yml that runs the indexer on doc-changed PRs
- Sets a
description field with battle-tested trigger phrases
You can build all of this by hand from this page. The bundle exists because most teams would rather skip the four sharp edges.
Where this fails
1. Cross-language code corpora. Voyage-code-3 handles 12 programming languages well; outside that set (Verilog, Solidity, Crystal), recall drops below 60%. Switch to a code-specific model — BGE-M3 self-hosted is the cheapest fallback at $5-20/M docs.
2. Versioned docs. A skill that retrieves from docs/v1/ and docs/v2/ without metadata filtering will mix versions in citations and confuse the user. Add a version payload field at index time and a version parameter to the retrieve script.
3. Private repos with sub-modules. The indexer walks the local filesystem; sub-modules either need their own skill or need to be ingested into the same corpus with source_repo metadata. Pick one before indexing — switching later requires reindexing.
4. The skill triggering inside autoresearch loops. If you run autoresearch (cluster A6) over the same project, the RAG skill will trigger inside the autoresearch loop and consume budget faster than expected. Set --budget on autoresearch and watch the per-call cost.
5. "It says nothing relevant." Below score 0.5 in the snippet above, the skill returns chunks anyway. Tune the threshold — if your eval corpus shows 0.6 is the cliff, set it there and return an empty list otherwise. Forcing the agent to answer from nothing is worse than letting it say "no relevant docs."
Composing with other skills
This skill is designed to play well with the rest of a typical Claude Code setup:
- With
/init — /init generates the CLAUDE.md; the RAG skill lives in .claude/skills/rag/` and does not need to be mentioned in CLAUDE.md at all. Skills auto-discover from the folder.
- With autoresearch — autoresearch's investigation loop will call this skill when investigating a question with project-doc relevance. Tighten the description if you only want it triggered on direct user questions.
- With Superpowers — Superpowers skills handle methodology (brainstorm → plan → implement). The RAG skill is a tactical retrieval primitive that any of those phases can call.
- With AGENTS.md — if you also support Cursor or Codex on the project, prefer the MCP server pattern (see /topic/rag-mcp-server). The skill pattern is Claude-Code-only.
What to read next
- /topic/rag-mcp-server — the alternative packaging, when you need multi-agent support
- /topic/embeddings — which model to use and why Voyage shows up everywhere
- /topic/vector-databases — pgvector vs Qdrant vs Pinecone decision tree
- /topic/claude-md — why this logic does NOT belong in CLAUDE.md
- /for/codebase-rag — same pattern, applied to indexing code instead of docs
- /for/documentation-sites — when your docs site is the corpus
Sources
- Anthropic. "Introducing Contextual Retrieval". The 67%-improvement claim, the prompt-caching cost ($1.02 per 1M document tokens), and the chunking-context technique.
- Voyage AI. "voyage-3-large". Source for the 10.58% beat over OpenAI text-embedding-3-large at 1/24 storage cost.
- PremAI. "RAG chunking strategies: the 2026 benchmark guide". The Vecta benchmark showing recursive 512-token beats semantic chunking 69%/54%.
- HumanLayer. CLAUDE.md "<60 lines" standard, referenced across Claude Code engineering posts; rationale is "frontier models handle ~150-200 instructions reliably."
- Eugene Yan. "LLM Patterns". The hybrid-retrieval-beats-embeddings finding, and Anthropic engineering-team-internal recommendations.
- Hacker News. "Production RAG at 5M+ documents" (id 45645349). The "73% of RAG failures are retrieval-side" practitioner consensus.
Related GitHub projects
claude-code
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
⭐ 122,880
everything-claude-code
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
⭐ 180,405
Frequently asked
- What does 'RAG as a Claude Code skill' actually mean?
- It means packaging the retrieval pipeline — chunker, embedder, vector store client, citation formatter — as a Claude Code skill that triggers when the user asks a question matching the skill's description. The skill lives at `.claude/skills/rag/SKILL.md` plus a small `scripts/retrieve.py` it shells out to. Unlike an always-on rule in CLAUDE.md, the skill only loads its context when the trigger phrases hit, which keeps your CLAUDE.md small.
- Why a skill instead of an MCP server?
- Both work. Pick the skill when the corpus is project-local (lives in this repo, indexed against this repo's docs), when you want zero dependencies beyond Claude Code, and when one developer or a small team is using it. Pick the MCP server when you want the same retrieval pipeline shared across Cursor, Codex, and Claude Code, or when the corpus is large enough to need a separate indexing service. See /topic/rag-mcp-server for the alternative pattern.
- How is this different from Anthropic's Contextual Retrieval?
- Contextual Retrieval is a chunking technique — prepend 50-100 tokens of chunk-specific context before embedding (Anthropic, September 2024). This page describes a *packaging* pattern — how to wrap whatever chunking technique you choose, including Contextual Retrieval, as a Claude Code skill. They are complementary. You can absolutely use Contextual Retrieval as the indexing step inside the skill.
- Will this skill work with /init or the autoresearch skill?
- Yes. Skills compose. A typical setup: `/init` creates the project's CLAUDE.md, the autoresearch skill handles overnight investigation, and the RAG skill answers documentation lookups. They never trigger simultaneously because their description phrases do not overlap. The trick is being precise in the SKILL.md `description` field — Claude Code routes based on it.
- What is the install cost for a team?
- Roughly one engineer-day. ~30 minutes to write the SKILL.md, ~2 hours to write the indexing script, ~1 hour to integrate the embedder and vector DB clients, the rest is testing and tuning chunk size + top_k. The published RuleSell skill bundles compress this to a 'one-click' install — drop in the skill folder, set two env vars, run the indexer.