Skip to main content

Prompt Caching & Cost Optimization

Advanced

If many of your requests share a large, unchanging chunk — a long system prompt, a big document, a tool catalog — prompt caching lets the API reuse the processed prefix instead of re-reading it every call. That cuts both cost and latency on the cached part.

How it works (the mental model)

You mark a cache breakpoint after the stable prefix. On the first call it's processed and cached; subsequent calls that share the exact same prefix hit the cache and pay much less for it.

Mark the breakpoint (copy-paste)

Add cache_control to the last stable block — here, a large system prompt. The user's turn comes after it and varies freely; everything up to and including the marked block is cached.

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_STABLE_PROMPT, # long, unchanging — the cached prefix
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "Summarize the key points."}], # varies per call
)

print(message.usage.cache_read_input_tokens) # > 0 means you got a hit

The first call pays a small write premium to populate the cache; every later call with the same prefix reads it back at a fraction of the input price. The prefix must be long enough to be eligible — a few thousand tokens, model-dependent — or it silently won't cache.

The invariant that makes or breaks it

:::warning Caching is prefix-exact A cache hit requires the cached prefix to be byte-for-byte identical. The most common bug: a silent invalidator near the top of the prompt — a timestamp, a changing user name, a reordered tool list — that changes the prefix and quietly drops your hit rate to zero. :::

Put everything stable first, everything variable last, and keep the prefix truly constant.

Check that it's actually working

Don't assume — read it back from the response usage:

  • cache_creation_input_tokens — tokens written to the cache this call (the first request).
  • cache_read_input_tokens — tokens served from the cache (the savings).
  • input_tokens — the uncached remainder, billed at full price.

If cache_read_input_tokens stays zero across repeated requests that should share a prefix, a silent invalidator is at work — diff the rendered prompt bytes between two calls to find it.

Where it pays off most

  • Long system prompts reused across users.
  • RAG / document Q&A where the same source text is queried repeatedly.
  • Agents with a fixed tool catalog and instructions over many turns.

Pair caching with batching for offline workloads, and with right-sizing the model (Choosing a Model) for the biggest combined savings — see Cost & Latency.

Next