Prompt Caching & Cost Optimization

Advanced

If many of your requests share a large, unchanging chunk — a long system prompt, a big document, a tool catalog — prompt caching lets the API reuse the processed prefix instead of re-reading it every call. That cuts both cost and latency on the cached part.

How it works (the mental model)

You mark a cache breakpoint after the stable prefix. On the first call it's processed and cached; subsequent calls that share the exact same prefix hit the cache and pay much less for it.

Mark the breakpoint (copy-paste)

Add cache_control to the last stable block — here, a large system prompt. The user's turn comes after it and varies freely; everything up to and including the marked block is cached.

Python
TypeScript

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_STABLE_PROMPT,  # long, unchanging — the cached prefix
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Summarize the key points."}],  # varies per call
)

print(message.usage.cache_read_input_tokens)  # > 0 means you got a hit

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: LARGE_STABLE_PROMPT, // long, unchanging — the cached prefix
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: "Summarize the key points." }], // varies per call
});

console.log(message.usage.cache_read_input_tokens); // > 0 means you got a hit

The first call pays a small write premium to populate the cache; every later call with the same prefix reads it back at a fraction of the input price. The prefix must be long enough to be eligible — a few thousand tokens, model-dependent — or it silently won't cache.

The invariant that makes or breaks it

:::warning Caching is prefix-exact A cache hit requires the cached prefix to be byte-for-byte identical. The most common bug: a silent invalidator near the top of the prompt — a timestamp, a changing user name, a reordered tool list — that changes the prefix and quietly drops your hit rate to zero. :::

Put everything stable first, everything variable last, and keep the prefix truly constant.

Check that it's actually working

Don't assume — read it back from the response usage:

cache_creation_input_tokens — tokens written to the cache this call (the first request).
cache_read_input_tokens — tokens served from the cache (the savings).
input_tokens — the uncached remainder, billed at full price.

If cache_read_input_tokens stays zero across repeated requests that should share a prefix, a silent invalidator is at work — diff the rendered prompt bytes between two calls to find it.

Where it pays off most

Long system prompts reused across users.
RAG / document Q&A where the same source text is queried repeatedly.
Agents with a fixed tool catalog and instructions over many turns.

Pair caching with batching for offline workloads, and with right-sizing the model (Choosing a Model) for the biggest combined savings — see Cost & Latency.

How it works (the mental model)​

Mark the breakpoint (copy-paste)​

The invariant that makes or breaks it​

Check that it's actually working​

Where it pays off most​

Next​

How it works (the mental model)

Mark the breakpoint (copy-paste)

The invariant that makes or breaks it

Check that it's actually working

Where it pays off most

Next