Cost & Latency Tradeoffs
Quality, cost, and speed pull against each other. You can't max all three at once — but you can spend each where it matters and save everywhere else.
The triangle
A bigger model is smarter but slower and pricier; a smaller one is fast and cheap but less capable. Good engineering is routing each task to the right point on this triangle.
The biggest levers (roughly in order)
- Right-size the model. Don't run Opus for classification. Start with Sonnet, drop to Haiku for simple/high-volume steps, reserve Opus for the hard parts — Choosing a Model.
- Model tiering / cascades. Use a cheap model first; escalate to a stronger one only when needed (e.g. low-confidence cases).
- Prompt caching. Reuse a stable prompt prefix across calls — big savings for repeated system prompts, RAG context, or agent tool catalogs.
- Trim input tokens. Send only what matters; RAG beats stuffing the whole knowledge base. Shorter inputs = cheaper and often better.
- Cap output with sensible
max_tokensand tight format instructions. - Batch offline work where latency doesn't matter.
Worked example: a cheap-first cascade
Cascading is the lever people skip because it sounds vague. Make it concrete. Say you need to triage 100,000 support emails. The naive approach runs every email through your strongest model. A cascade routes most of the volume to a cheap model and escalates only the hard cases:
Suppose the cheap model resolves 90% of cases on its own and a stronger model costs roughly 5× as much per token. Counting in relative cost units (1 = one cheap pass over one email):
- All-strong:
100k × 5 = 500kcost units. - Cascade:
100k × 1(every email gets the cheap pass)+ 10k × 5(the 10% that escalate)= 150kcost units.
That's about a 70% cut, and the genuinely hard cases still get your best model. The multipliers and split are illustrative — plug in your own token counts and rates and measure the real escalation rate with evals. The lesson holds regardless: the savings come from how rarely you pay the expensive rate, not from the rate itself.
Latency-specific wins
- Stream responses so users see output immediately — huge for perceived speed even when total time is unchanged (Streaming).
- Parallelize independent sub-calls.
- Cache repeated work; precompute where you can.
- Pick a smaller model for the interactive path; do heavy lifting async.
Don't optimize blind
Measure first: where are the tokens and the seconds actually going? Then optimize the biggest line item. And re-check quality with evals after any cost cut — a cheaper setup that's wrong isn't cheaper.