Prompt Caching & Cost Optimization
If many of your requests share a large, unchanging chunk — a long system prompt, a big document, a tool catalog — prompt caching lets the API reuse the processed prefix instead of re-reading it every call. That cuts both cost and latency on the cached part.
How it works (the mental model)
You mark a cache breakpoint after the stable prefix. On the first call it's processed and cached; subsequent calls that share the exact same prefix hit the cache and pay much less for it.
Mark the breakpoint (copy-paste)
Add cache_control to the last stable block — here, a large system prompt. The user's turn comes after it and varies freely; everything up to and including the marked block is cached.
- Python
- TypeScript
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_STABLE_PROMPT, # long, unchanging — the cached prefix
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "Summarize the key points."}], # varies per call
)
print(message.usage.cache_read_input_tokens) # > 0 means you got a hit
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const message = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: LARGE_STABLE_PROMPT, // long, unchanging — the cached prefix
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: "Summarize the key points." }], // varies per call
});
console.log(message.usage.cache_read_input_tokens); // > 0 means you got a hit
The first call pays a small write premium to populate the cache; every later call with the same prefix reads it back at a fraction of the input price. The prefix must be long enough to be eligible — a few thousand tokens, model-dependent — or it silently won't cache.
The invariant that makes or breaks it
:::warning Caching is prefix-exact A cache hit requires the cached prefix to be byte-for-byte identical. The most common bug: a silent invalidator near the top of the prompt — a timestamp, a changing user name, a reordered tool list — that changes the prefix and quietly drops your hit rate to zero. :::
Put everything stable first, everything variable last, and keep the prefix truly constant.
Check that it's actually working
Don't assume — read it back from the response usage:
cache_creation_input_tokens— tokens written to the cache this call (the first request).cache_read_input_tokens— tokens served from the cache (the savings).input_tokens— the uncached remainder, billed at full price.
If cache_read_input_tokens stays zero across repeated requests that should share a prefix, a silent invalidator is at work — diff the rendered prompt bytes between two calls to find it.
Where it pays off most
- Long system prompts reused across users.
- RAG / document Q&A where the same source text is queried repeatedly.
- Agents with a fixed tool catalog and instructions over many turns.
Pair caching with batching for offline workloads, and with right-sizing the model (Choosing a Model) for the biggest combined savings — see Cost & Latency.