إنتقل إلى المحتوى الرئيسي

Evaluate Your Claude Agent (Evals)

متقدّم

You tweaked a prompt and it feels better — but is it? Without evals (evaluations) you're flying blind: every change is a coin flip, and you find out it broke from an angry user, not a test. Evals turn "vibes" into a number you can trust, defend, and watch over time. This is the single biggest thing that separates hobby prompts from production-grade Claude work.

What you'll learn
  • Why "it looks good to me" is not a test — and what to measure instead
  • Build a golden dataset from REAL failures (bottom-up), not imagined ones
  • Score with code where you can, and an LLM-as-judge where you can't
  • Wire evals into CI so a prompt or model change can never silently regress

The mindset: measure, don't guess

Three rules that save you:

  • Bottom-up beats top-down. Collect actual failures first, then design the metric to catch them. An eval built from real breakage predicts real breakage; an eval invented at a whiteboard mostly measures your imagination.
  • A number you can re-run. An eval is repeatable: same inputs → comparable score. That's what lets you compare prompt v1 vs v2, or claude-haiku-4-5 vs claude-sonnet-4-6, honestly.
  • Cheap to run, run often. If it takes a human an afternoon, it won't happen. Automate it.

Build a golden dataset (bottom-up)

Your golden dataset is the heart of every eval — a curated set of inputs with known-good expectations.

Guided walkthrough1 of 4
  1. Start from actual bad outputs: production traces, bug reports, support tickets. These are the cases that matter.

Score: code first, judge second

Reach for the cheapest reliable check first.

  • Programmatic (deterministic) checks — use these wherever the answer has structure: exact/keyword match, "valid JSON against this schema", "did it call the right tool with the right args", "under N tokens / under X ms". Fast, free, and never flaky.
  • LLM-as-judge — for fuzzy dimensions (helpfulness, tone, faithfulness to a source) that resist code. Give the judge a rubric, not a vibe, and calibrate it against human labels before you trust it.

:::warning Judges have biases LLM judges drift toward longer answers (verbosity bias) and toward whichever option is shown first (position bias). Defenses: a strict rubric, pairwise comparison instead of absolute scoring, swapping answer order, and re-checking the judge against a human-labeled slice. A judge is one layer, not the whole test. :::

LLM-as-judge rubric (starter)

You are a strict grader. You are given a QUESTION, a REFERENCE answer, and a MODEL answer.
Score the MODEL answer from 1-5 on (a) faithfulness to the reference and (b) helpfulness.
Output ONLY JSON, nothing else: {"score": <1-5>, "reason": "<one short sentence>"}

QUESTION: {{question}}
REFERENCE: {{reference}}
MODEL: {{model_answer}}

For agents, also test the trajectory

An agent can land the right final answer the wrong way — looping, calling a destructive tool, or burning your budget. So evaluate the path, not just the destination: did it call the right tools, in a sane order, without loops, within budget? Tool-call correctness and trajectory checks catch failures a final-answer-only eval never sees.

Wire it into CI

This is where evals pay off: make regressions impossible to merge.

Guided walkthrough1 of 3
  1. Score programmatically where possible; run the judge on the rest.
Eval vocabulary
Term shown.
1 / 4

Check yourself

0/3
  1. What is the most reliable first choice for scoring an eval?
  2. Where should a golden dataset's cases mostly come from?
  3. For an AGENT, what should you evaluate beyond the final answer?
Key takeaways
  • No eval = shipping on vibes. Build one before you trust a prompt or agent.
  • Golden dataset from real failures; grow it every week from new regressions.
  • Code-based checks first; LLM-as-judge (with a rubric, calibrated) for the fuzzy parts.
  • For agents, grade the trajectory, not just the output.
  • Run it in CI and fail the build on a drop — that's how quality stops regressing.

Sources & further reading

Next