Topic · A7
LLM evals: the Hamel process encoded as rulesets (2026)
Hamel Husain's eval process: 60-80% of dev time on error analysis, custom annotation tools, binary judges, review 100 traces. Here's how to encode that as a tool-agnostic ruleset that survives the next acquisition.
# LLM evals: the Hamel process encoded as rulesets Most teams treat evals as a tool selection problem. They ask "should we use LangSmith or Langfuse?" and lose six weeks comparing dashboards. Then OpenAI acquires Promptfoo and ClickHouse acquires Langfuse in the same two-month window, and the question turns out to have been the wrong one. The right question is what process produces working evals, independent of which tool you happen to run it through. Hamel Husain has answered that question more clearly than anyone in the space. This page walks through his framework, names the exact disciplines that fail when teams skip them, and shows how to encode the whole thing as a ruleset that survives the next acquisition.
The premise: evals are error analysis
The single most important sentence in Hamel's evals FAQ is this:60-80% of dev time on error analysis.Read it twice. The implication is that the eval tooling — the dashboards, the YAML configs, the judge frameworks — accounts for at most 20-40% of the work. The bulk is humans reading traces, finding patterns, and naming failure modes. Teams that invert this ratio (heavy on tools, light on annotation) ship evals that pass while their app gets worse. The companion claim, from the same FAQ:
Review at least 100 traces.And, on the question of what a healthy eval suite looks like:
If you're passing 100% of your evals, you're likely not challenging your system enough — 70% pass rate might indicate a more meaningful evaluation.Each of these is the opposite of how vendor blog posts pitch the work. Vendors sell tools; tools are easier to ship than discipline.
The process, step by step
The actual sequence — not all at once, and not in any vendor's quickstart: 1. Instrument production traces. Every LLM call, every tool call, every retrieval, with the prompt, the response, the model+version, the user input, the relevant business context. Tracing is the prerequisite for everything else. Phoenix, Langfuse, Helicone, Logfire — pick one based on stack fit, not on which has the prettiest dashboard. See /topic/llm-observability. 2. Sample roughly 100 traces. Stratified by surface area: different user types, different prompt variants, different tools called. The point is texture, not statistical power. If 100 feels like too many, your traces are too repetitive — sample wider. 3. Hand-annotate failure modes in a custom tool inside your app. This is the step every team wants to skip. The reason to build it inside your app: the annotator sees the same UI, the same customer metadata, the same conversation history as the production user did. Generic playgrounds force a context-shift that quietly degrades the annotation quality. Hamel's claim: teams with custom annotation tools iterate roughly 10× faster. 4. Build the error taxonomy. After 100 traces you'll have 5-12 recurring categories: hallucinated tool arguments, retrieval-misses-the-fact, refused-when-shouldn't, formatted-wrong, leaked-system-prompt, etc. Name them. The taxonomy is the artifact that survives every tool migration. 5. Iterate on the system before writing judges. Once you know your top three failure modes, the right next move is usually a prompt change, a retrieval fix, a tool-definition tightening, or a model swap — not an automated judge. Judges encode an error taxonomy after you've already moved on the obvious fixes. 6. Write binary judges for each taxonomy category. Pass/fail, single sentence per judge prompt, calibrated against your hand-annotations until the judge agrees with the human at least 85% of the time. Pairwise comparison ("is response A better than B?") is the other valid format for ranking work. Likert scales are noise. 7. Run the judge suite in CI. Per-PR for prompt changes, nightly against a sampled production trace stream. Alert when any judge crosses a regression threshold. 8. Sample 5-10 new traces every week and re-annotate. Categories drift. New failure modes emerge as users get bolder. Without a continuous annotation pass the taxonomy goes stale and the judges silently miss new failures. Steps 1, 2, 6, and 7 are tool work. Steps 3, 4, 5, and 8 are discipline. The discipline is where the value is.What goes wrong when teams skip steps
The recurring failure patterns, named: Skipping the custom annotation tool. Teams use a generic trace viewer, annotation quality collapses within two weeks, the taxonomy never converges, judges get written against a fuzzy mental model and the eval suite passes everything. Skipping the human pass and going straight to LLM-as-judge. The judge hallucinates a coherent-sounding rubric that doesn't reflect actual failure modes. Pass rates look fine. The app gets worse and nobody knows why. Using Likert scales because they "give more information." They don't. They give the illusion of information. Judges drift across model versions in ways that look like model regression but are actually rubric drift. Binary judges fail cleanly when they fail. Reviewing 10 traces instead of 100. Patterns don't emerge at 10. You see anecdotes, not categories. The taxonomy under-fits and the judges miss whole classes of failure. Treating the eval suite as set-and-forget. Annotation must continue weekly. New failure modes show up as the user base broadens; the suite must broaden with them.Encoding the process as a ruleset
The shape of a process-encoded ruleset:- Trace instrumentation policy. Which calls get traced, what metadata, what retention. Tool-agnostic — same policy whether you're running Phoenix or Langfuse.
- Sampling strategy. How traces are selected for annotation. Stratification rules. Re-sampling cadence.
- Annotation template. The exact columns, the exact category-naming convention, the calibration set of 20 traces that every new annotator must label before contributing to the live set.
- Error taxonomy. The named categories, with definitions and example traces for each.
- Judge prompts. One file per category. Versioned. Each with the calibration accuracy against the human annotations.
- CI configuration. The thresholds, the trace stream, the alert routing.
- Re-annotation cadence. How often the team revisits the taxonomy.
The acquisition lens
Two of the three biggest open-source eval tools changed hands in early 2026. Langfuse to ClickHouse in January. Promptfoo to OpenAI in March. The full story is in /topic/promptfoo-after-openai. The implication for evals strategy is simple: the choice of runner is decreasingly stable, so the strategy has to live a layer above it. That's what the Hamel process is — a layer above any specific tool. Shreya Shankar, on X:AI evals curricula should be tool-agnostic. It is better to learn the processes, because then you can (i) evaluate any tool and (ii) build your own.The ruleset is the artifact that makes that real.
The numeric case for evals discipline
Galileo's State-of-AI-Evaluation report, summarized: 84.9% of teams that ship production AI features hit a meaningful incident within 6 months. Teams that allocate 40%+ of dev time to evals score 26.7 points higher on a reliability composite. That's a vendor PDF, so treat it as directional — but the direction is consistent with what every honest practitioner says about the field. The cost-of-LLM-as-judge math, also rarely cited: fine-tuned small-model judges (Galileo Luna-2 reference) run roughly $175/month for 1M judge queries. The same workload on a frontier-model judge runs $6,000-16,000/month depending on the metric. If you're at that volume, the right step in the process is "distill the judge" — once the prompt and the rubric stabilize, you fine-tune a 3-8B model on the judgments and cut cost by 36-95×.What to read next
- /topic/promptfoo-after-openai — why this process matters now
- /topic/llm-as-judge — the judge-prompt library
- /topic/llm-observability — Phoenix, Langfuse, Helicone, Logfire
- /topic/claude-skill-evals — process applied to Claude Code skills
Sources
- Husain, Hamel. "Evals FAQ" — 60-80% of dev time on error analysis; review 100 traces; 70% pass rate beats 100%.
- Shankar, Shreya. X thread on tool-agnostic eval curricula.
- Yan, Eugene. "LLM evaluators" — pairwise over direct scoring; judge drift over time.
- Galileo. State of AI Evaluation report — 84.9% incident rate, 26.7-point reliability gap, Luna-2 vs frontier-judge cost math.
- Anthropic. "Demystifying evals for AI agents".
- swyx, Latent Space. "In San Francisco, there's more people building agent eval companies than actually building agents."
Related GitHub projects
promptfoo
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
⭐ 21,181
langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
⭐ 27,069
Frequently asked
- What is the 'Hamel process' for LLM evals?
- Hamel Husain's framework — published across hamel.dev and the Parlance Labs course on Maven — argues that LLM evaluation is mostly error analysis, not tool selection. The process: instrument production traces, sample ~100 of them, hand-annotate failure modes, build a custom annotation tool inside your app (not in a generic playground), iterate on prompts/retrieval/tool definitions, and only then write automated judges that encode the error taxonomy you discovered. The headline number — 60-80% of dev time on error analysis — comes directly from his evals FAQ.
- Why does Hamel insist on building a custom annotation tool?
- Generic eval playgrounds force you to fit your domain into their schema. Hamel's argument is that teams who built domain-specific annotation interfaces inside their own app iterate roughly 10× faster, because the annotator sees the same context as production users — including domain-specific UI affordances, customer metadata, and the exact prompt that ran. A LangSmith or Langfuse trace viewer is fine for debugging; it's not fine for the systematic annotation pass that produces an error taxonomy.
- Binary judges or 1-5 Likert scoring?
- Binary, almost always. Hamel and Eugene Yan converge here from different angles. Likert scales drift across LLM judge versions, conflate severity with confidence, and give a false sense of granularity. Binary pass/fail forces you to define what 'pass' means precisely enough to encode in a judge prompt, which is the whole point of the exercise. Pairwise comparison ('is A better than B?') is the other valid mode, especially for ranking-style problems.
- Why are rulesets the right encoding for the eval process?
- Because tools get acquired. Promptfoo (OpenAI, March 2026) and Langfuse (ClickHouse, January 2026) both changed hands in 8 weeks. The eval process — judge prompts, error taxonomy, sampling strategy, annotation template, regression-detection thresholds — is independent of which runner executes it. Encoding the process as a versioned ruleset means a tool swap is a one-day migration rather than a six-month rewrite. See /topic/promptfoo-after-openai for the consolidation story.
- How many traces do I actually need to review?
- Start with 100. That number is in the evals FAQ and it's a floor, not a target. The point of 100 isn't statistical power — it's giving you enough texture to spot recurring failure modes. Most teams find that 5-8 categories cover 80% of failures after the first 100, and the next 100 mostly confirms the taxonomy. You're done with the annotation pass when new traces stop adding new categories, not when you hit a magic number.
- What about LLM-as-judge cost? Isn't that expensive at scale?
- It is, and the math is published but rarely cited. Galileo's Luna-2 fine-tuned eval model runs roughly $175/month for 1M judge queries. The same volume on GPT-3.5 lands around $6,248/month; RAGAS Faithfulness around $7,994; TruLens Groundedness around $16,641. Fine-tuned small-model judges are 36-95× cheaper than naive frontier-model judges. If your judge cost is a meaningful slice of your bill, the process should include a 'distill the judge' step once the prompt stabilizes.