Evaluate Your Claude Agent (Evals)

Advanced

You tweaked a prompt and it feels better — but is it? Without evals (evaluations) you're flying blind: every change is a coin flip, and you find out it broke from an angry user, not a test. Evals turn "vibes" into a number you can trust, defend, and watch over time. This is the single biggest thing that separates hobby prompts from production-grade Claude work.

What you'll learn

Why "it looks good to me" is not a test — and what to measure instead
Build a golden dataset from REAL failures (bottom-up), not imagined ones
Score with code where you can, and an LLM-as-judge where you can't
Wire evals into CI so a prompt or model change can never silently regress

The mindset: measure, don't guess

Three rules that save you:

Bottom-up beats top-down. Collect actual failures first, then design the metric to catch them. An eval built from real breakage predicts real breakage; an eval invented at a whiteboard mostly measures your imagination.
A number you can re-run. An eval is repeatable: same inputs → comparable score. That's what lets you compare prompt v1 vs v2, or claude-haiku-4-5 vs claude-sonnet-4-6, honestly.
Cheap to run, run often. If it takes a human an afternoon, it won't happen. Automate it.

Build a golden dataset (bottom-up)

Your golden dataset is the heart of every eval — a curated set of inputs with known-good expectations.

Guided walkthrough1 of 4

Start from actual bad outputs: production traces, bug reports, support tickets. These are the cases that matter.

Score: code first, judge second

Reach for the cheapest reliable check first.

Programmatic (deterministic) checks — use these wherever the answer has structure: exact/keyword match, "valid JSON against this schema", "did it call the right tool with the right args", "under N tokens / under X ms". Fast, free, and never flaky.
LLM-as-judge — for fuzzy dimensions (helpfulness, tone, faithfulness to a source) that resist code. Give the judge a rubric, not a vibe, and calibrate it against human labels before you trust it.

:::warning Judges have biases LLM judges drift toward longer answers (verbosity bias) and toward whichever option is shown first (position bias). Defenses: a strict rubric, pairwise comparison instead of absolute scoring, swapping answer order, and re-checking the judge against a human-labeled slice. A judge is one layer, not the whole test. :::

LLM-as-judge rubric (starter)

You are a strict grader. You are given a QUESTION, a REFERENCE answer, and a MODEL answer.
Score the MODEL answer from 1-5 on (a) faithfulness to the reference and (b) helpfulness.
Output ONLY JSON, nothing else: {"score": <1-5>, "reason": "<one short sentence>"}

QUESTION: {{question}}
REFERENCE: {{reference}}
MODEL: {{model_answer}}

For agents, also test the trajectory

An agent can land the right final answer the wrong way — looping, calling a destructive tool, or burning your budget. So evaluate the path, not just the destination: did it call the right tools, in a sane order, without loops, within budget? Tool-call correctness and trajectory checks catch failures a final-answer-only eval never sees.

Wire it into CI

This is where evals pay off: make regressions impossible to merge.

Guided walkthrough1 of 3

Score programmatically where possible; run the judge on the rest.

Eval vocabulary

Term shown.

1 / 4

Check yourself

0/3

What is the most reliable first choice for scoring an eval?
Where should a golden dataset's cases mostly come from?
For an AGENT, what should you evaluate beyond the final answer?

Key takeaways

No eval = shipping on vibes. Build one before you trust a prompt or agent.
Golden dataset from real failures; grow it every week from new regressions.
Code-based checks first; LLM-as-judge (with a rubric, calibrated) for the fuzzy parts.
For agents, grade the trajectory, not just the output.
Run it in CI and fail the build on a drop — that's how quality stops regressing.

Sources & further reading

LLM-as-a-Judge: top techniques & best practices — DeepEval — rubrics, calibration, and judge bias.
AI Agent Evaluation Guide 2026 — testing tools, trajectories & monitoring — golden-dataset volume targets and CI integration.
LLM-as-a-Judge: 7 best practices & templates — Monte Carlo — practical judge templates and pitfalls.
LLM Evaluation: practical tips at Booking.com — lessons from production-scale evaluation.
Anthropic — develop your tests / evaluate — official guidance on building empirical evals for Claude.

The gap evals exist to close → The Capability–Reliability Gap
Stack more power moves → Pro Workflows & Power Moves
Make outputs scoreable by code → Structured Output · Tool Use

The mindset: measure, don't guess​

Build a golden dataset (bottom-up)​

Score: code first, judge second​

LLM-as-judge rubric (starter)

For agents, also test the trajectory​

Wire it into CI​