Prompt caching deep-dive: the 4 patterns that compound savings

What caching actually does

Prompt caching lets the provider skip recomputation on the input prefix that hasn't changed between calls. You pay full price the first time. After that, the cached prefix runs at roughly 10% of list rate and at much lower latency.

The mental model: every call is a new conversation, but the stuff at the front of every conversation is the same — system prompt, tool definitions, retrieved corpus, ongoing conversation history. That stuff goes in the cache. Only the new turn pays full input price.

The math: a workload with a 12K-token system prompt and a 200- token user turn pays 12,200 tokens of input per call uncached. Cached, the steady-state cost is roughly 1,400 tokens of effective input (12K × 10% + 200 × 100%). That's a ~88% reduction on input cost on every warm call.

Where to put cache breakpoints

Cache breakpoints are the markers that say "everything before this is cacheable, the cache key is the cumulative content up to this point." Place them at semantic boundaries — places where content stops being shared across calls and starts being per-call.

The breakpoints we use, in order of how much we cache at each:

End of system prompt. Always. The system prompt is by definition stable across calls.
End of tool definitions. Always, when tools are present. Tool defs are stable.
End of static context block. The "here's everything we know about this customer / this document / this corpus" block. Stable across the conversation, changes per session.
End of conversation history (mid-conversation). The accumulated history grows; placing a breakpoint at the end of the most recent turn lets future turns hit the cache up to here.

You don't need a breakpoint between every section — too many breakpoints fragments the cache without proportional benefit. Two to four is the right range for most workloads.

Pattern 1: Static system prompt

The simplest and highest-ROI pattern. The system prompt describes the agent's role, persona, output format, refusal rules. None of it changes between calls. Cache it.

Even on workloads where you don't think there's much to cache — a quick classifier, a one-shot extractor — the system prompt tends to be 800–4,000 tokens once you write it carefully. Caching it reduces the bill on every call by the percentage of the input that lives in that block.

Hit rate on this pattern: typically 95%+ once you control the system-prompt rev rate. Every prompt change invalidates the cache for ~5 minutes; if you're shipping prompt changes daily, you're paying full price 80% of the time. Slow the prompt churn.

Pattern 2: Cached corpus / cached document

The "stuff the whole document in" pattern from long-context patterns. Document, FAQ, knowledge corpus, or full customer history sits in the cached prefix. Per-call cost is the few hundred tokens of question, not the 200K tokens of content.

This is the pattern that makes long-context workloads economically viable. Without caching, a 200K-token document costs $1.00+ per call on Sonnet. With caching, steady state is closer to $0.10 per call. Same quality, ten-fold cost difference.

Hit rate: depends on session length. A "Q&A about a contract" session is high — every question after the first hits cache. A one-shot lookup is low — you pay full price every time. Pattern 2 wins on multi-turn sessions and loses on single-call workloads.

Pattern 3: Conversation-history cascade

For agents and chat sessions, the conversation grows. Each new turn appends to the history. If you place a breakpoint at the end of the latest assistant turn, the next user turn re-hits cache for everything up to that point — 80%+ of the conversation's accumulated tokens.

The trick that compounds: keep the breakpoint moving. After each turn, move it to the new tail. Hit rate at every step stays high; the accumulated savings across a 30-turn agent loop are large.

Steady-state cost on a long agentic session looks closer to "input ≈ N × user-turn-size" rather than the worst-case "input ≈ N² × turn size." That's the difference between an agent loop you can afford and one you can't.

Pattern 4: Multi-tenant cached templates

For products that serve many customers off the same base prompt — a coding assistant where the system prompt is identical across every user, a support drafter where the tone guide is constant — the cache can be shared across tenants.

In practice this means structuring your prompt so the identical-across-tenants block is at the front, with the per-tenant content (customer name, settings, history) appended after a breakpoint. Across thousands of users, the cache hit rate on the shared block is effectively 100%.

The biggest mistake we see in this pattern is interleaving tenant-specific values into the shared block ("you're helping [User Name]"). That single substitution invalidates the cache for that user. Pull it out, put it after the breakpoint.

The two anti-patterns we keep finding

Anti-pattern 1: dynamic timestamps in the system prompt

"Today's date is 2026-04-09. The current time is 14:32:18." Putting a per-call value in the system prompt invalidates the cache for every single call. The cache hit rate drops to ~0%.

Fix: if the model needs the date, give it as a tool the model can call, or put it after the breakpoint as part of the per-call user message. Never in the cached prefix.

Anti-pattern 2: shuffled or randomized context

Some teams shuffle the order of retrieved docs in the prompt "for better attention." If the order changes between calls, the cache misses. You're paying full price on every call to fight a problem that may not exist.

Fix: stable ordering. If you really need randomness, you've chosen worse cache economics for marginal quality benefit. Almost always not worth it.

How to measure if it's working

The provider's response includes cache_read_input_tokens and cache_creation_input_tokens. Log them both. The metric that matters is cache hit rate by token: cache_read / (cache_read + cache_creation + uncached).

Targets we use:

System-prompt-only caching: 90%+ hit rate at steady state.
Document-level caching with multi-turn sessions: 80%+ hit rate weighted by session.
Conversation-history cascade: 75%+ across the full session.

If your hit rate is under those numbers, the cache breakpoints are wrong, the prefix has dynamic content, or your traffic pattern is ill-suited to caching (one-shot calls). Diagnose before adding more caching.

The cost ceiling, briefly

On a perfectly-cached workload, your steady-state cost approaches: output × output_rate + per_call_input × input_rate. The cached prefix fades into the background. Almost every audit we run finds 30–70% of cost is removable just by getting the cache right.

For the bill picture across our actual cohort, see what our clients actually pay for Claude. Profile A's caching strategy is the single biggest difference between $19K/mo and $4.9K/mo on similar workload shapes.

The summary

Cache the static prefix at the right boundaries, keep the prefix actually static, measure the hit rate, fix the anti-patterns. Every audit we've ever run found caching headroom. Often a lot of it. It's the cheapest cost reduction available on any LLM workload — and it's usually already supported by the provider's API; teams just haven't enabled it.

Prompt caching: the 4 patterns that compound savings