Intermediate

The Capability-Reliability Gap

Here is a pattern that burns almost everyone who ships AI to real users for the first time:

The model does the thing perfectly in your test. It fails in production. You're confused, because you saw it work.

What you ran into is the capability-reliability gap.

Capability means the model can do a task — it produces a correct output at least once, under some conditions.

Reliability means the model consistently does the task correctly — across varied inputs, across repeated runs, across slight changes in phrasing or context.

Demos prove capability. Production requires reliability. These are different properties, and most guides confuse them.

Why demos lie

When you test a prompt, you typically:

Run it on inputs you designed yourself
Run it a handful of times
Cherry-pick the output that looks good
Tweak the prompt until it looks right

This process optimizes for capability. The prompt now works on your examples. You've seen a correct output. You ship it.

The problem is that user inputs in production are not your examples. They're messier, more varied, phrased in ways you didn't anticipate. The model was never tested on them. You have no idea how it performs on them.

A single good output is not a performance estimate. It's an anecdote.

Variance is the hidden variable

LLMs are stochastic. Run the same prompt twice and you often get different outputs. This variance is normal and usually fine. But it means that the relevant question is not "did it work?" — it's "what fraction of the time does it work?"

A task where the model succeeds 95% of the time looks great in a demo and breaks on roughly one in twenty users. A task where it succeeds 60% of the time looks fine when you're the one running it. These are very different situations, and you cannot tell them apart without measuring.

The capability-reliability spectrum in practice

Dimension	Capable but unreliable	Reliable
Inputs tested	Author-designed examples	Diverse, real-user inputs
Sample size	A few runs	Repeated runs on many examples
Failure mode visibility	Failures are rare in testing, common in production	Failures are measured and understood
How you find out it broke	User complaints	Your eval suite
How you improve it	Guess and check prompts	Track pass rate, debug failures systematically
Deployment confidence	Vibe-based	Evidence-based

Evals are the real moat

Better prompts can raise capability. Only evals can tell you whether you've raised reliability.

An eval is a structured test: a set of inputs, expected outputs or evaluation criteria, and a way to measure pass rate. You run the model on the inputs, score the outputs, and get a number. Then you change something — the prompt, the model, the temperature — and run it again. Now you have a signal.

This is not glamorous. It's the part of AI product work that most tutorials skip entirely. But it's the only way to answer the question that actually matters when you're shipping: "How often does this work on inputs I haven't seen?"

A simple way to start

You don't need infrastructure to begin. Here's a minimum viable eval loop:

Build a golden set. Collect 20–50 real or realistic inputs. For each one, write what a correct output looks like (or criteria for judging it). These are your golden examples.
Run it N times. Run your prompt on each example multiple times. Variance across runs tells you about prompt stability; variance across examples tells you about coverage.
Track pass rate. For each (input, run) pair, record pass or fail. Compute the overall rate. This number is the start of your reliability picture.
Make it a regression test. Every time you change the prompt, run the eval again. If pass rate drops, you've broken something. If it rises, you've made a real improvement.

That's it. A spreadsheet works. The discipline matters more than the tooling.

Why this is an engineering problem, not a prompting problem

The instinct when a model fails is to rewrite the prompt. Sometimes that's right. But often it's a way of optimizing for the failure case you saw, at the cost of regressing on cases you didn't check.

Reliability engineering for AI looks like:

Defining what "correct" means before you run anything
Measuring against a representative input distribution
Tracking changes over time with consistent methodology
Distinguishing "this model can't do this task" from "this task is underspecified"

Prompt engineering is a tool within that process. It is not a substitute for it.

The honest framing

Most AI capabilities are real. The models genuinely can do remarkable things. The capability-reliability gap is not an argument that the capabilities are fake — it's an argument that knowing they exist is not enough.

If you need a task to work 95% of the time, you need evidence that it works 95% of the time. That evidence comes from running structured tests, not from confidence in the demo.

The engineers who build durable AI products are not necessarily the ones who write the best prompts. They're the ones who know what "working" means before they ship, and who have a measurement that tells them whether it's true.

Why demos lie​

Variance is the hidden variable​

The capability-reliability spectrum in practice​

Evals are the real moat​

A simple way to start​

Why this is an engineering problem, not a prompting problem​

The honest framing​

Related​