Evaluate Your Claude Agent (Evals)
You tweaked a prompt and it feels better — but is it? Without evals (evaluations) you're flying blind: every change is a coin flip, and you find out it broke from an angry user, not a test. Evals turn "vibes" into a number you can trust, defend, and watch over time. This is the single biggest thing that separates hobby prompts from production-grade Claude work.
- Why "it looks good to me" is not a test — and what to measure instead
- Build a golden dataset from REAL failures (bottom-up), not imagined ones
- Score with code where you can, and an LLM-as-judge where you can't
- Wire evals into CI so a prompt or model change can never silently regress
The mindset: measure, don't guess
Three rules that save you:
- Bottom-up beats top-down. Collect actual failures first, then design the metric to catch them. An eval built from real breakage predicts real breakage; an eval invented at a whiteboard mostly measures your imagination.
- A number you can re-run. An eval is repeatable: same inputs → comparable score. That's what lets you compare prompt v1 vs v2, or
claude-haiku-4-5vsclaude-sonnet-4-6, honestly. - Cheap to run, run often. If it takes a human an afternoon, it won't happen. Automate it.
Build a golden dataset (bottom-up)
Your golden dataset is the heart of every eval — a curated set of inputs with known-good expectations.
- Start from actual bad outputs: production traces, bug reports, support tickets. These are the cases that matter.
- By hand, write cases covering your most critical and most error-prone scenarios. This is your stable anchor set.
- Add de-identified production samples (strip PII) and synthetic cases for under-represented scenarios. Don't trust aggregate metrics on a tiny set.
- Every new production regression becomes a new test case. A golden dataset is alive, not frozen.
Score: code first, judge second
Reach for the cheapest reliable check first.
- Programmatic (deterministic) checks — use these wherever the answer has structure: exact/keyword match, "valid JSON against this schema", "did it call the right tool with the right args", "under N tokens / under X ms". Fast, free, and never flaky.
- LLM-as-judge — for fuzzy dimensions (helpfulness, tone, faithfulness to a source) that resist code. Give the judge a rubric, not a vibe, and calibrate it against human labels before you trust it.
:::warning Judges have biases LLM judges drift toward longer answers (verbosity bias) and toward whichever option is shown first (position bias). Defenses: a strict rubric, pairwise comparison instead of absolute scoring, swapping answer order, and re-checking the judge against a human-labeled slice. A judge is one layer, not the whole test. :::
LLM-as-judge rubric (starter)
You are a strict grader. You are given a QUESTION, a REFERENCE answer, and a MODEL answer.
Score the MODEL answer from 1-5 on (a) faithfulness to the reference and (b) helpfulness.
Output ONLY JSON, nothing else: {"score": <1-5>, "reason": "<one short sentence>"}
QUESTION: {{question}}
REFERENCE: {{reference}}
MODEL: {{model_answer}}For agents, also test the trajectory
An agent can land the right final answer the wrong way — looping, calling a destructive tool, or burning your budget. So evaluate the path, not just the destination: did it call the right tools, in a sane order, without loops, within budget? Tool-call correctness and trajectory checks catch failures a final-answer-only eval never sees.
Wire it into CI
This is where evals pay off: make regressions impossible to merge.
- Score programmatically where possible; run the judge on the rest.
- Set a threshold (e.g. score must not fall vs main). A prompt change that regresses quality can't ship.
- When a judge flags a live response, route it to a human review queue; the reviewer confirms, adds the case to the golden set, and re-tests after the fix.
Check yourself
0/3- No eval = shipping on vibes. Build one before you trust a prompt or agent.
- Golden dataset from real failures; grow it every week from new regressions.
- Code-based checks first; LLM-as-judge (with a rubric, calibrated) for the fuzzy parts.
- For agents, grade the trajectory, not just the output.
- Run it in CI and fail the build on a drop — that's how quality stops regressing.
Sources & further reading
- LLM-as-a-Judge: top techniques & best practices — DeepEval — rubrics, calibration, and judge bias.
- AI Agent Evaluation Guide 2026 — testing tools, trajectories & monitoring — golden-dataset volume targets and CI integration.
- LLM-as-a-Judge: 7 best practices & templates — Monte Carlo — practical judge templates and pitfalls.
- LLM Evaluation: practical tips at Booking.com — lessons from production-scale evaluation.
- Anthropic — develop your tests / evaluate — official guidance on building empirical evals for Claude.
Next
- The gap evals exist to close → The Capability–Reliability Gap
- Stack more power moves → Pro Workflows & Power Moves
- Make outputs scoreable by code → Structured Output · Tool Use