Tokens, Context & Pricing
Cost and limits on the API are all measured in tokens (~¾ of a word). Three things to get right.
1. Count tokens correctly
Don't guess, and don't use another model's tokenizer (e.g. tiktoken) — token counts differ per model family. Use Anthropic's token counting endpoint/SDK helper to measure a request before sending it. Rough planning rule: ~750 words ≈ ~1,000 tokens.
2. max_tokens ≠ context window
max_tokenscaps the length of the reply. If output gets cut off, raise it.- The context window is the total budget for input + output. Big inputs leave less room for output.
Set max_tokens to what the task needs — too low truncates; needlessly high doesn't cost more (you pay for tokens generated) but can let replies ramble.
3. Estimate cost
You're billed for input tokens + output tokens, at per-model rates (Opus > Sonnet > Haiku). A quick estimate:
cost ≈ (input_tokens × input_rate) + (output_tokens × output_rate)
Get the current rates from the official pricing page — we don't hard-code them here on purpose.
Cutting cost (without losing quality)
- Right-size the model — start with Sonnet; reserve Opus for hard parts (Choosing a Model).
- Prompt caching — reuse a stable prompt prefix across calls.
- Trim inputs — send only the context that matters (this is also where RAG helps).
- Batch offline work where latency doesn't matter.
More strategy in Cost & Latency Tradeoffs.