メインコンテンツまでスキップ

Cut Your Token Usage (and Cost)

中級

You pay for every token in and every token out. The good news: most real workloads are carrying dead weight — bloated system prompts, re-sent context, verbose replies, the wrong model for an easy job. Trim that and the bill drops without touching quality. This page is the power-user toolkit, ordered roughly by leverage.

What you'll learn
  • Where tokens actually leak — input vs output vs reused context
  • The terse / 'caveman' style: what it really saves, and where it backfires
  • Prompt caching and batching for dollar-for-dollar structural savings
  • Right-sizing the model (Haiku for cheap tasks) and structured output over prose
  • Measuring before you ship with the token-counting endpoint

First, find where the tokens go

Before optimizing, split your spend into three buckets — each has a different fix:

The three token buckets
Term shown.
1 / 3

The fixes line up cleanly: cache the reused context, trim the input, shorten the output, right-size the model, and batch what isn't time-sensitive.

The terse / "caveman" style (output savings)

The viral move is telling Claude to drop filler and answer in fragments — popularized by the open-source caveman Claude Code skill (MIT-licensed, by Julius Brussee), whose tagline is "why use many token when few token do trick." It forces short sentences, infinitive verbs, and zero pleasantries.

The honest takeaway from independent testing: the style costs nothing on content (code, technical terms, JSON stay exact) but the savings depend entirely on your baseline. If your prompts already say "be concise," most of the win is already banked. The big reductions (40–65%) show up on explanation-heavy answers; structured extraction barely moves.

Lean instruction block — paste into your system prompt

Answer terse. Cut filler, hedging, and pleasantries.
Drop articles (a/an/the) and softeners (just, really, basically, actually).
No preamble, no restating the question, no "happy to help."
Fragments are fine. Keep technical terms and code blocks exact.
Pattern per point: [thing] [action] [reason]. Next step if any.
Pro tip
  • Put the terse rule once in the system prompt, not in every user turn — repeating it re-pays the input cost each call.
  • Never compress code, identifiers, JSON, or numbers. Compress prose only.
  • 'Be concise. Return JSON only.' is itself ~60% of the achievable output savings — write it before reaching for fancier tricks.
Watch out
  • Terse style trims OUTPUT only. It does nothing for a 20k-token system prompt you re-send every call — that's an input/caching problem.
  • On extended-thinking tasks, the reasoning tokens are unaffected; you only shrink the final visible answer.

Cache the reused prefix (input savings)

If many calls share a large unchanging chunk — a long system prompt, a tool catalog, a reference document — prompt caching processes it once and reuses it at a fraction of the input price on every later call. This is the single highest-leverage structural change for chat and agent workloads, because it pays back on every turn.

The one rule: the cached prefix must be byte-for-byte identical across calls. A stray timestamp or reordered tool list near the top silently drops your hit rate to zero. Full mechanics, the copy-paste cache_control snippet, and how to verify hits live in Prompt Caching & Cost Optimization.

Guided walkthrough1 of 3
  1. Move the system prompt, tools, and documents to the front; keep the user's changing turn at the end.

Trim the context you send

Caching reuses context cheaply, but the cheapest token is the one you never send. Audit what's actually in the window:

  • Prune the system prompt. Long instruction blocks accumulate cruft. Cut examples that no longer earn their tokens; keep one strong example over five mediocre ones.
  • Retrieve, don't dump. Instead of pasting an entire document, fetch only the relevant passages (RAG). Sending a 50-page PDF to answer one question is the most common waste.
  • Compact long sessions. When a conversation grows, replace old turns with a short running summary instead of carrying every message forever. The history is input tokens you re-pay on every call.
  • Right-size the tool catalog. Each tool definition is input tokens on every request. Expose only the tools the current task needs.

Right-size the model

Don't pay Opus rates for a Haiku-grade task. Classification, extraction, simple formatting, and routing usually run great on the smallest model at a fraction of the per-token price. Reserve the larger models for genuinely hard reasoning, and consider routing: a cheap model handles the easy majority, escalating only the hard cases. See Choosing a Model and Tokens, Context & Pricing for the tradeoffs.

Prefer structured output over prose

Asking for JSON (or another tight schema) instead of an explanatory paragraph cuts output tokens and removes the parsing guesswork downstream. Telling Claude to return only a compact object like {"label": ..., "score": ...} generates a fraction of the tokens of a chatty answer — and you skip the "Here's the result:" preamble entirely. Details in Structured Output.

Batch what isn't time-sensitive

For offline work where you don't need an answer in seconds — evals, bulk classification, dataset labeling, summarizing an archive — Anthropic's Message Batches API runs requests asynchronously at a 50% discount on both input and output tokens, with results typically returned within 24 hours.

Stack this with caching and a right-sized model and the combined discount on a large offline job is dramatic.

Measure — don't guess

Optimize against numbers, not vibes. Anthropic's token-counting endpoint returns the exact input-token count for a request before you send it — same shape as a Messages call, and it's free (rate-limited). Use it to compare a bloated prompt against a trimmed one, to make model-routing decisions, and to keep prompts inside the context window.

Count tokens before sending (Python SDK)

import anthropic

client = anthropic.Anthropic()

resp = client.messages.count_tokens(
  model="claude-opus-4-8",
  system="You are a scientist",
  messages=[{"role": "user", "content": "Hello, Claude"}],
)
print(resp.input_tokens)  # exact input count, no charge for counting
Pro tip
  • Don't use another model's tokenizer (e.g. tiktoken) — counts differ per model family. Use Anthropic's endpoint.
  • Newer tokenizers can produce ~30% more tokens for the same text than older models — re-count when migrating, don't reuse old estimates.
  • Read input_tokens, cache_read_input_tokens, and output_tokens from the response usage to confirm savings landed in production.

See Tokens, Context & Pricing for the counting rules and cost-estimation formula.

A before/after, end to end

A support-triage assistant runs the same 4,000-token system prompt + tool catalog on every ticket and writes a chatty 600-token reply.

Guided walkthrough1 of 4
  1. ~4,000 input tokens re-sent at full price every call + ~600 verbose output tokens, on a large model. Nothing cached, synchronous, prose replies.

Each lever is multiplicative: cached input × smaller model × terser output × batch discount compounds into a large total reduction — while the answer quality on this easy task is unchanged. Measure each step with count_tokens so you can prove the win rather than assume it.

Check yourself

0/4
  1. The 'caveman' / terse style primarily reduces which tokens?
  2. You re-send the same 10k-token system prompt on every call. Best fix?
  3. Which workload is the best fit for the Message Batches API's 50% discount?
  4. How should you verify a prompt change actually saved tokens?
Key takeaways
  • Split spend into input, output, and reused context — each bucket has a different fix.
  • Terse/'caveman' style cuts output only; gains are big on prose, small on already-concise structured tasks.
  • Cache the stable prefix (byte-for-byte identical) for dollar-for-dollar input savings on every call.
  • Trim context, right-size the model, and prefer JSON over prose — cheap, compounding wins.
  • Batch non-urgent work for a 50% discount, and always measure with count_tokens instead of guessing.

Sources & further reading

Next