Tokens, Context & Pricing

Beginner

Cost and limits on the API are all measured in tokens (~¾ of a word). Three things to get right.

1. Count tokens correctly

Don't guess, and don't use another model's tokenizer (e.g. tiktoken) — token counts differ per model family. Use Anthropic's token counting endpoint/SDK helper to measure a request before sending it. Rough planning rule: ~750 words ≈ ~1,000 tokens.

2. `max_tokens` ≠ context window

max_tokens caps the length of the reply. If output gets cut off, raise it.
The context window is the total budget for input + output. Big inputs leave less room for output.

Set max_tokens to what the task needs — too low truncates; needlessly high doesn't cost more (you pay for tokens generated) but can let replies ramble.

3. Estimate cost

You're billed for input tokens + output tokens, at per-model rates (Opus > Sonnet > Haiku). A quick estimate:

cost ≈ (input_tokens × input_rate) + (output_tokens × output_rate)

Get the current rates from the official pricing page — we don't hard-code them here on purpose.

Cutting cost (without losing quality)

Right-size the model — start with Sonnet; reserve Opus for hard parts (Choosing a Model).
Prompt caching — reuse a stable prompt prefix across calls.
Trim inputs — send only the context that matters (this is also where RAG helps).
Batch offline work where latency doesn't matter.

More strategy in Cost & Latency Tradeoffs.

1. Count tokens correctly​

2. max_tokens ≠ context window​

3. Estimate cost​

Cutting cost (without losing quality)​

Next​

1. Count tokens correctly

2. `max_tokens` ≠ context window

3. Estimate cost

Cutting cost (without losing quality)

Next