Choosing a Model & Provider
How do you pick among models and providers without getting lost in hype? With a simple, evergreen process — because the specific leaderboard changes monthly, but the way to choose doesn't.
Read benchmarks skeptically
Public benchmark scores are a starting hint, not a verdict:
- They can be gamed or contaminated (test data leaking into training).
- They measure generic tasks, not your task.
- Small score gaps rarely matter in practice.
Use them to build a shortlist, not to make the final call.
The only benchmark that counts: yours
Run a tiny eval on a handful of your real inputs across 2–3 candidate models. It takes minutes and tells you what no leaderboard can. This "bake-off" is the single best habit in model selection.
A decision scorecard
Weigh what actually matters for your use case:
| Factor | Ask |
|---|---|
| Quality on your task | Does the bake-off show it's good enough? |
| Cost | Per-token price at your volume (Cost & Latency) |
| Latency | Fast enough for the experience? |
| Capabilities | Vision? Long context? Tool use? Structured output? |
| Privacy/compliance | Data handling, residency, certifications (Privacy) |
| Reliability & ecosystem | Uptime, SDKs, docs, support, migration story |
| Lock-in | How hard to switch later? |
Practical posture
- Default to a capable mid-tier model and only move up/down on evidence.
- Abstract the model behind config, not scattered literals, so switching is a one-line change (Errors & Migration).
- Re-evaluate periodically — the frontier moves fast; today's best may not be next quarter's.
(For the Claude-specific tiers, see Choosing a Claude Model.)