Cut Your Token Usage (and Cost)
You pay for every token in and every token out. The good news: most real workloads are carrying dead weight — bloated system prompts, re-sent context, verbose replies, the wrong model for an easy job. Trim that and the bill drops without touching quality. This page is the power-user toolkit, ordered roughly by leverage.
- Where tokens actually leak — input vs output vs reused context
- The terse / 'caveman' style: what it really saves, and where it backfires
- Prompt caching and batching for dollar-for-dollar structural savings
- Right-sizing the model (Haiku for cheap tasks) and structured output over prose
- Measuring before you ship with the token-counting endpoint
First, find where the tokens go
Before optimizing, split your spend into three buckets — each has a different fix:
The fixes line up cleanly: cache the reused context, trim the input, shorten the output, right-size the model, and batch what isn't time-sensitive.
The terse / "caveman" style (output savings)
The viral move is telling Claude to drop filler and answer in fragments — popularized by the open-source caveman Claude Code skill (MIT-licensed, by Julius Brussee), whose tagline is "why use many token when few token do trick." It forces short sentences, infinitive verbs, and zero pleasantries.
The honest takeaway from independent testing: the style costs nothing on content (code, technical terms, JSON stay exact) but the savings depend entirely on your baseline. If your prompts already say "be concise," most of the win is already banked. The big reductions (40–65%) show up on explanation-heavy answers; structured extraction barely moves.
Lean instruction block — paste into your system prompt
Answer terse. Cut filler, hedging, and pleasantries. Drop articles (a/an/the) and softeners (just, really, basically, actually). No preamble, no restating the question, no "happy to help." Fragments are fine. Keep technical terms and code blocks exact. Pattern per point: [thing] [action] [reason]. Next step if any.
- Put the terse rule once in the system prompt, not in every user turn — repeating it re-pays the input cost each call.
- Never compress code, identifiers, JSON, or numbers. Compress prose only.
- 'Be concise. Return JSON only.' is itself ~60% of the achievable output savings — write it before reaching for fancier tricks.
- Terse style trims OUTPUT only. It does nothing for a 20k-token system prompt you re-send every call — that's an input/caching problem.
- On extended-thinking tasks, the reasoning tokens are unaffected; you only shrink the final visible answer.
Cache the reused prefix (input savings)
If many calls share a large unchanging chunk — a long system prompt, a tool catalog, a reference document — prompt caching processes it once and reuses it at a fraction of the input price on every later call. This is the single highest-leverage structural change for chat and agent workloads, because it pays back on every turn.
The one rule: the cached prefix must be byte-for-byte identical across calls. A stray timestamp or reordered tool list near the top silently drops your hit rate to zero. Full mechanics, the copy-paste cache_control snippet, and how to verify hits live in Prompt Caching & Cost Optimization.
- Move the system prompt, tools, and documents to the front; keep the user's changing turn at the end.
- Attach cache_control to the last stable block so the whole prefix is cached. See the snippet in our prompt-caching page.
- Read cache_read_input_tokens from the response usage — greater than zero means you're saving; zero across repeated calls means a silent invalidator.
Trim the context you send
Caching reuses context cheaply, but the cheapest token is the one you never send. Audit what's actually in the window:
- Prune the system prompt. Long instruction blocks accumulate cruft. Cut examples that no longer earn their tokens; keep one strong example over five mediocre ones.
- Retrieve, don't dump. Instead of pasting an entire document, fetch only the relevant passages (RAG). Sending a 50-page PDF to answer one question is the most common waste.
- Compact long sessions. When a conversation grows, replace old turns with a short running summary instead of carrying every message forever. The history is input tokens you re-pay on every call.
- Right-size the tool catalog. Each tool definition is input tokens on every request. Expose only the tools the current task needs.
Right-size the model
Don't pay Opus rates for a Haiku-grade task. Classification, extraction, simple formatting, and routing usually run great on the smallest model at a fraction of the per-token price. Reserve the larger models for genuinely hard reasoning, and consider routing: a cheap model handles the easy majority, escalating only the hard cases. See Choosing a Model and Tokens, Context & Pricing for the tradeoffs.
Prefer structured output over prose
Asking for JSON (or another tight schema) instead of an explanatory paragraph cuts output tokens and removes the parsing guesswork downstream. Telling Claude to return only a compact object like {"label": ..., "score": ...} generates a fraction of the tokens of a chatty answer — and you skip the "Here's the result:" preamble entirely. Details in Structured Output.
Batch what isn't time-sensitive
For offline work where you don't need an answer in seconds — evals, bulk classification, dataset labeling, summarizing an archive — Anthropic's Message Batches API runs requests asynchronously at a 50% discount on both input and output tokens, with results typically returned within 24 hours.
Stack this with caching and a right-sized model and the combined discount on a large offline job is dramatic.
Measure — don't guess
Optimize against numbers, not vibes. Anthropic's token-counting endpoint returns the exact input-token count for a request before you send it — same shape as a Messages call, and it's free (rate-limited). Use it to compare a bloated prompt against a trimmed one, to make model-routing decisions, and to keep prompts inside the context window.
Count tokens before sending (Python SDK)
import anthropic
client = anthropic.Anthropic()
resp = client.messages.count_tokens(
model="claude-opus-4-8",
system="You are a scientist",
messages=[{"role": "user", "content": "Hello, Claude"}],
)
print(resp.input_tokens) # exact input count, no charge for counting- Don't use another model's tokenizer (e.g. tiktoken) — counts differ per model family. Use Anthropic's endpoint.
- Newer tokenizers can produce ~30% more tokens for the same text than older models — re-count when migrating, don't reuse old estimates.
- Read input_tokens, cache_read_input_tokens, and output_tokens from the response usage to confirm savings landed in production.
See Tokens, Context & Pricing for the counting rules and cost-estimation formula.
A before/after, end to end
A support-triage assistant runs the same 4,000-token system prompt + tool catalog on every ticket and writes a chatty 600-token reply.
- ~4,000 input tokens re-sent at full price every call + ~600 verbose output tokens, on a large model. Nothing cached, synchronous, prose replies.
- Mark the 4,000-token system+tools block with cache_control. After the first call it's served as cache reads at a fraction of input price.
- Move triage to a small model and demand JSON only ({"category", "priority", "reply"}). Output drops from ~600 prose tokens to ~120.
- Overnight ticket backfills go through the Batches API at the 50% discount instead of synchronous calls.
Each lever is multiplicative: cached input × smaller model × terser output × batch discount compounds into a large total reduction — while the answer quality on this easy task is unchanged. Measure each step with count_tokens so you can prove the win rather than assume it.
Check yourself
0/4- Split spend into input, output, and reused context — each bucket has a different fix.
- Terse/'caveman' style cuts output only; gains are big on prose, small on already-concise structured tasks.
- Cache the stable prefix (byte-for-byte identical) for dollar-for-dollar input savings on every call.
- Trim context, right-size the model, and prefer JSON over prose — cheap, compounding wins.
- Batch non-urgent work for a 50% discount, and always measure with count_tokens instead of guessing.
Sources & further reading
- JuliusBrussee/caveman — the open-source "caveman" Claude Code skill (MIT, by Julius Brussee); claims ~65% average output-token reduction.
- "I Benchmarked the Viral 'Caveman' Prompt…" — DEV Community — independent benchmark finding ~9–21% on structured tasks and a leaner 6-line alternative.
- Token counting — Anthropic docs — the
count_tokensendpoint and per-model tokenizer notes. - Prompt caching — Anthropic docs — cache mechanics, eligibility, and pricing.
- Introducing the Message Batches API — Anthropic — async processing at a 50% discount.
- Pricing — Anthropic docs — current per-token rates by model.