Advanced

Cache mechanics, pricing tiers, TTL durations, and minimum token thresholds change as Anthropic updates the platform. Do not rely on any specific numbers from third-party guides. Always check the official prompt caching docs and the Models & Pricing page for current values.

Prompt Caching Economics

Every time you call the Claude API, you pay for every input token you send — including your system prompt, your tool definitions, and any context you inject. If you're making many calls with the same large prefix, you're paying to process that prefix from scratch every single time.

Prompt caching changes that. You mark a stable portion of your prompt as cacheable. The first call processes it and stores it. Subsequent calls that hit the cache skip that processing — and pay a fraction of the normal rate for those tokens.

The savings are not cosmetic. For applications with large, stable system prompts or heavy context, caching can change the economics of a feature from "too expensive to ship" to "basically free to run."

The mental model: stable prefix, volatile suffix

Think of every API call as two parts:

The stable prefix — content that does not change across calls. This is where caching applies. Examples:

Your system prompt
Tool definitions
A large reference document or codebase you inject on every call
A long few-shot example block

The volatile suffix — content that changes per call. This is where caching does not apply. Examples:

The current user message
Real-time data you inject per request
Conversation history that grows with each turn

The rule is simple: structure your prompts so that the stable content comes first, and the changing content comes last. Cache breakpoints are positional — everything before the marked breakpoint is eligible for caching; everything after is not.

If you put dynamic content before static content, you break the cache, because the prefix changes on every request.

How the cost model works

Prompt caching introduces a three-way split in how input tokens are priced:

Token type	When it occurs	Cost relative to normal input
Cache write	First call, or after cache expires	Higher than normal input
Cache read	Subsequent calls that hit the cache	Much lower than normal input
Regular input	Tokens after the last cache breakpoint	Normal rate

The exact multipliers are on the official pricing page and fluctuate — check them directly. What does not change is the structure: writes cost more than normal, reads cost much less. The crossover point — when you've made enough cached calls to recoup the write overhead — comes quickly on any prompt with substantial stable content.

Latency follows the same pattern. Cache reads skip the full processing of the cached portion, which reduces time-to-first-token meaningfully on calls with large prefixes.

When prompt caching helps

Caching pays off when two conditions are both true:

You have a substantial stable prefix (there is a minimum token threshold below which caching is not available — check current docs for the exact number by model).
You make that prefix is reused frequently enough to hit the cache more than occasionally.

Scenarios where caching is a natural fit:

Document Q&A — the same large document is injected for many user questions.
Coding assistants — a large codebase or file tree is included in every request.
Agentic loops — the same system prompt and tool definitions are sent on every step of a multi-step workflow.
Conversational agents with long instructions — a detailed persona, rule set, or knowledge base that never changes call-to-call.
Batch processing — many inputs run against the same template.

When prompt caching does not help

Caching is not useful when:

Your prompts are short (below the minimum cacheable token threshold).
Your prefix changes on every call — injecting a timestamp, a user-specific context, or any personalization into the "stable" part invalidates the cache.
You make infrequent calls with long gaps between them. Cached content expires after a TTL (a duration you can configure, within the limits the API supports). If your traffic is sparse, you'll mostly be paying write costs with few reads.
Your prefix is small relative to the per-call dynamic content. The savings scale with the size of what's cached.

Structuring prompts to be cache-friendly

The only structural requirement is ordering: stable content must come before volatile content.

In practice:

[System prompt — instructions, persona, rules]
[Tool definitions — if static]
[Large injected documents or context — if the same across calls]
--------- cache breakpoint here ---------
[Dynamic per-call content — user message, retrieved chunks specific to this request]

You mark the breakpoint with a cache_control field on the last block you want to include in the cache. Everything before that marker is eligible for caching; everything after is regular input.

You can place up to four explicit breakpoints in a single request. This is useful when different parts of your prompt change at different frequencies — for example, tool definitions change rarely, conversation history changes every turn. Each section can have its own breakpoint.

Cache invalidation rules

The cache is sensitive to ordering. Any change to a cached block, or to any block that comes before it in the prompt, invalidates the cache at that breakpoint and all subsequent ones.

Changes to tool definitions invalidate all caches. Changes to the system prompt invalidate the system and message caches. Changes mid-conversation affect only the message cache.

The practical implication: if you're injecting anything that changes per request, make absolutely sure it lives after the last cache breakpoint, not before or within it.

Pre-warming and monitoring

You can pre-warm the cache before user traffic arrives by sending a request with max_tokens: 0 — this writes the cache without generating output. Useful for batch jobs or for front-loading the write cost during off-peak hours.

The API response's usage field tells you how many tokens were read from cache (cache_read_input_tokens), written to cache (cache_creation_input_tokens), and billed as regular input. Monitor these to verify your caching is actually hitting and to measure the savings you're realizing.

Use the Cost Calculator to model expected savings before committing to a caching architecture.

The bottom line

Prompt caching is not a micro-optimization. For any application that sends the same large prefix repeatedly, it's a structural economic decision. The model for thinking about it is straightforward: stable content goes first, volatile content goes last, put the breakpoint at the boundary.

If you're building against the API and haven't looked at caching yet, check your current prompts for large stable sections. If they exist, enabling caching is usually low-effort and the savings are real.

The mental model: stable prefix, volatile suffix​

How the cost model works​

When prompt caching helps​

When prompt caching does not help​

Structuring prompts to be cache-friendly​

Cache invalidation rules​

Pre-warming and monitoring​

The bottom line​

Related​