Why do Claude Code skills need a separate eval discipline?

Because they have two surfaces that can fail independently. A skill can have perfect output and still be useless if Claude never picks it up — the trigger description didn't match how users actually phrase requests. A skill can also fire reliably and produce generic output that wastes context. Standard LLM evals test the output. Skill evals have to test the matching step first, then the output conditional on matching. Anthropic's March 2026 Skills 2.0 release ships a trigger-precision evaluator natively; most teams still aren't using it.

What is 'trigger precision'?

The rate at which Claude correctly selects your skill when the user's request is in-scope, and correctly skips it when out-of-scope. Two numbers, not one: recall (did Claude pick it when it should have?) and precision (did Claude avoid picking it when it shouldn't have?). A skill at 90% recall and 50% precision fires constantly and pollutes context. A skill at 50% recall and 95% precision is invisible. The target is roughly 85/90 for most skills; trigger-only skills (like project-init helpers) tolerate lower precision because the cost of an over-fire is small.

How do I build a trigger-precision eval bundle?

Start with 30-50 example user prompts split into three buckets: in-scope (should trigger), adjacent-scope (might reasonably trigger), and out-of-scope (must not trigger). Run Claude Code in a controlled session and record which skill it picks for each prompt. Calculate recall on the in-scope bucket and precision across all three. Iterate on the SKILL.md description until both numbers hit target. Promptfoo, DeepEval, or a 30-line custom runner all work — the discipline is what matters.

What about evaluating output quality once the skill triggers?

Standard LLM eval applies: pick 5-10 representative user prompts that should trigger the skill, run them through, and judge the outputs against the skill's stated purpose. Binary judges work fine here — did the output do the thing the skill is supposed to do, yes or no. The trap is conflating output evals with trigger evals. Test them separately; the failures look completely different.

Is there a published reference outside Anthropic's docs?

Anthropic's docs cover the API for trigger configuration, but practitioner-facing pieces on the two-surface eval pattern are still rare in May 2026. Most of what exists lives in scattered Reddit/HN threads and our own ruleset bundles. The gap is exactly why this page exists — if you know of a good reference we missed, mail the maintainers and we'll link it.

Does the same approach work for MCP servers?

Partially. MCP servers have a similar two-surface problem (tool-selection precision plus tool-output quality), but the eval tooling is much weaker — Promptfoo never handled MCP transport or tool-schema validation, see HN 47412524. The MCP eval landscape is its own page; see /topic/mcp-eval for the current state of MCPSpec, MCPjam, mcpbr, and agent-vcr.

Evaluating Claude Code skills: trigger precision and output quality (2026)

# Evaluating Claude Code skills: trigger precision and output quality A Claude Code skill fails in two specific ways, and most teams evaluate only one of them. Failure A — the skill never fires. Description language doesn't match how users phrase the request. The skill sits in ~/.claude/skills/ doing nothing while Claude reaches for its general behavior instead. Output quality is irrelevant because the output never happens. Failure B — the skill fires and produces generic output. Triggering works, the skill loads, Claude follows the instructions — and the result reads exactly like what Claude would have written without the skill. Trigger was right; substance was empty. These are independent failure modes with independent fixes. A skill can have a 95th-percentile trigger description and useless output, or a precise output template and a description so vague Claude never selects it. Treating them as one number is how teams ship skills that score "good" in their eval suite while being invisible in production. This page covers the two-surface eval bundle: how to test trigger precision, how to test output quality once the trigger fires, and how to encode the whole thing as a ruleset that runs in CI.

Surface 1: trigger precision

Trigger precision is two numbers, not one. Recall — when a user's request is genuinely in-scope for the skill, how often does Claude select it? A skill at 60% recall is missing 40% of the cases it was built to handle. Precision — when Claude selects the skill, how often is the request actually in-scope? A skill at 50% precision fires for unrelated requests half the time and burns context on irrelevant instructions. The target depends on the skill's cost-of-overfire. A heavy ruleset skill with a 5,000-token system prompt needs high precision because every false-positive costs measurably. A light helper skill that just adds three lines of guidance tolerates lower precision because the cost of an over-fire is small.

Building the eval set

Three buckets, 30-50 prompts total:

In-scope (10-20 prompts). Things a real user would say where the skill should clearly trigger. Vary phrasing — formal, casual, abbreviated, misspelled — because production users do all of these.

Adjacent-scope (10-15 prompts). Requests that are close to the skill's domain but on the wrong side of the line. The point of this bucket is to surface description-overreach. If your skill triggers for 80% of the adjacent bucket, the description is too broad.

Out-of-scope (10-15 prompts). Completely unrelated requests. The skill must reliably skip these. Failures here mean the description is leaking signal into the matcher.

Running the eval

Anthropic's Skills 2.0 release (March 2026) added a built-in trigger-precision evaluator that runs your prompt set through the actual skill-selection logic. If you're on a recent Claude Code build, use it — it's faster and matches production behavior. If not, the manual version is straightforward: start a clean Claude Code session per prompt, paste the prompt, and record which skill (if any) Claude picks before generating any output. Promptfoo's assert: javascript mode works for this if you parse the skill-selection log. DeepEval has no native skill-selection metric but can run the harness with a custom assertion. A 30-line Node script that wraps claude-code --print-skill-selected is the simplest implementation and the one we recommend for first runs.

Iterating on the description

After the first eval pass, the description is wrong somewhere. The two common patterns:

Recall low, precision high. Your description is too narrow or uses jargon users don't say. Broaden the trigger language; add 3-5 alternate phrasings.

Recall high, precision low. Your description leaks signal. Tighten the trigger language; add a "do not trigger when" section that names the adjacent-scope cases explicitly.

Re-run after each change. Two or three iterations is usually enough to hit 85/90.

Surface 2: output quality

Once trigger precision is calibrated, the output eval is standard LLM evaluation — but with one specific shape that matters for skills. A skill's output is conditioned on its instructions. The judge prompt should test whether the output reflects those instructions, not whether the output is "good" in some abstract sense. If your skill says "always produce a 3-step plan before any code," the judge checks for the 3-step plan. If the skill says "refuse if the user asks for X," the judge checks for refusal on the X cases. Judges that ignore the skill's specific contract test Claude's baseline, not your skill.

The output eval set

5-10 prompts that you already verified will trigger the skill (from the recall bucket). Run them through, collect outputs, judge against a binary rubric per skill instruction. A skill with 4 distinct instructions gets 4 judges, each pass/fail. This is where Hamel's process (see /topic/llm-evals) applies: 100 traces, custom annotation, binary judges, error taxonomy. The Claude-Code-skill case is a constrained version of the general LLM-eval case, with the skill instructions playing the role of the error taxonomy.

Putting it together as an eval bundle

The artifact that survives across tools is a ruleset that packages:

prompts.in-scope.json — the recall bucket

prompts.adjacent-scope.json — the precision-tightening bucket

prompts.out-of-scope.json — the negative-control bucket

trigger-eval.config.yaml — the runner config (Promptfoo, DeepEval, or custom)

*output-judges/.md — one judge per skill instruction

output-prompts.json — the 5-10 representative outputs

thresholds.yaml — the targets (recall ≥ 0.85, precision ≥ 0.90, each output judge ≥ 0.90)

ci.yaml — the GitHub Actions / equivalent config to run this on every change to the SKILL.md

A ruleset like this is portable across runners. The Promptfoo-acquisition story (see /topic/promptfoo-after-openai) is the reason that portability matters: which CLI you happen to run today is the least durable part of the stack.
What goes wrong without this
Three failure patterns that recur in skills that ship without two-surface evaluation: The "looks great in testing" skill. Author tests with 5 prompts they wrote themselves while writing the skill. All 5 trigger. All 5 produce reasonable output. Shipped. In production, users say things in 50 different ways and trigger precision is actually 30%. The skill is invisible. The "fires for everything" skill. Description was written aspirationally rather than descriptively. Claude picks it for nearly any request that mentions a related word. Context fills up with irrelevant instructions. The skill ranks as "high coverage" in dashboards while degrading every conversation. The "good trigger, generic output" skill.** Trigger language was iterated on carefully; output template is one line of "follow best practices." Claude fires the skill reliably and then writes what it would have written anyway. The metric (trigger rate) goes up; the actual value (output quality) is zero. Each of these is caught by the right surface of the bundle. None is caught by a single combined metric.