Skip to content

Topic · A8

Anthropic prompt caching with the AI SDK: the 95%-off stack (2026)

Batch + cache stacks to 95% off on Anthropic system tokens — $1.00 → $0.50 → $0.05 per MTok. The math is in the docs but nobody publishes it. Here's the provider-wrapper, the dashboard recipe, and the cases where caching never hits.

# Anthropic prompt caching with the AI SDK: the 95%-off stack Anthropic's pricing page mentions in one line that the Batch API discount stacks with prompt-cache reads. Their docs don't show the math. Vercel AI SDK's docs cover the syntax for cache_control but not the cache-hit-rate measurement story. The result is a stack where the savings are real, documented, and almost nobody publishes the worked example. This page is the worked example. The provider wrapper that gets caching on by default, the dashboard recipe that tells you why your cache isn't hitting, and the specific gotchas that turn "should save 90%" into "actually saved 12%."

The math

Start with Claude Haiku 4.5 system-token input at $1.00/MTok. (Sonnet and Opus scale up but the multipliers are the same.) | Step | Effective rate | Discount vs base | |---|---:|---:| | Standard input | $1.00 / MTok | — | | Cache write (first call) | $1.25 / MTok | +25% (caching has a write premium) | | Cache read (subsequent calls) | $0.10 / MTok | -90% | | Batch API standard input | $0.50 / MTok | -50% | | Batch API + cache read | $0.05 / MTok | -95% | The 1-hour cache tier shifts the read rate to roughly $0.10/MTok instead of $0.05. For a workload that revisits the same 50K-token system prompt 1,000 times a day:
  • No caching, no batch: 50K × 1,000 × $1.00/M = $50/day
  • Caching only: 50K × 1 × $1.25/M (first write) + 50K × 999 × $0.10/M = $0.06 + $5.00 = roughly $5.00/day
  • Caching + batch (eligible workloads): roughly $2.50/day
The savings are non-linear in the prefix size. A 100K-token cached prefix at 1,000 calls/day saves roughly $95/day vs uncached. At 10,000 calls/day, $950/day. The break-even on the cache-write premium happens around the second call. This is the part nobody publishes. Anthropic mentions stacking in passing; finout.io covers the angle on the batch discount specifically; the AI SDK docs cover the syntax. None of them connects the three.

The provider wrapper

The AI SDK exposes Anthropic's cache_control through providerOptions. The minimal correct shape on a system message: ``ts import { anthropic } from '@ai-sdk/anthropic'; import { generateText } from 'ai'; const SYSTEM_PROMPT = …your 30K-token system prompt…; const result = await generateText({ model: anthropic('claude-haiku-4-5'), messages: [ { role: 'system', content: SYSTEM_PROMPT, providerOptions: { anthropic: { cacheControl: { type: 'ephemeral' } }, }, }, { role: 'user', content: userMessage }, ], }); ` Three points of detail that the docs gloss: Place cacheControl on the boundary, not the whole conversation. The cache breakpoint is the message you mark, and Anthropic caches everything before and including that marker. Mark the last message of your stable prefix. The user message after it is dynamic and doesn't get cached. The minimum token threshold is per-model. 1024 tokens on Sonnet/Opus, 2048 on Haiku. Below these the SDK still sends the header but Anthropic ignores it. If your prefix is small, caching does nothing. Cache prefixes must be byte-identical. A timestamp in the system prompt invalidates every call. So does a randomly-injected ID, a session-specific instruction, or a model-version string that changes on deploy. The discipline is to keep all variable content in messages after the cache marker. For multi-turn conversations where you want the entire conversation up to the last assistant turn cached: `ts messages: [ { role: 'system', content: SYSTEM_PROMPT }, ...priorTurns.slice(0, -1), { ...priorTurns[priorTurns.length - 1], providerOptions: { anthropic: { cacheControl: { type: 'ephemeral' } }, }, }, { role: 'user', content: currentUserMessage }, ] ` The cache marker moves with each turn. The 5-minute TTL refreshes on each cache read, so an active conversation stays warm naturally.

Measuring hit rate

The AI SDK returns Anthropic's full usage object on each response. The relevant fields:
`ts result.usage = { inputTokens: 30_412, outputTokens: 287, cachedInputTokens: 28_900, // tokens read from cache (cheap) cacheCreationInputTokens: 1_512, // tokens written to cache (expensive) } ` Cache hit rate is cachedInputTokens / inputTokens. A healthy long-prefix conversation lands in the 85-95% range after the first call. A first call always shows 0% cached and a non-zero cacheCreationInputTokens. The minimum useful dashboard:
  • Hit rate over time (rolling 1h, 24h, 7d)
  • Hit rate by endpoint — which routes aren't caching and why
  • Cache-write tokens / cache-read tokens — the write-to-read ratio tells you if your prefix is stable or churning
  • Cost-saved vs no-cache baseline — multiply cached tokens by ($1.00 - $0.10)/M for the saved spend; this is the number to share with finance
A dashboard panel in 30 lines of SQL against any observability backend (Langfuse, Phoenix, Helicone, Logfire — see /topic/llm-observability) is enough.

The diagnosis tree when caching doesn't hit

In descending order of frequency: 1. Prefix below minimum tokens. Symptom:
cachedInputTokens is always 0, no error from the API. Check the prefix size. If it's under 1024 (Sonnet/Opus) or 2048 (Haiku) tokens, the cache_control header is silently ignored. Fix: add shared context to the prefix until it crosses the threshold, or accept that this endpoint doesn't cache. 2. Prefix not byte-identical across requests. Symptom: cachedInputTokens is 0 on the second call but cacheCreationInputTokens is non-zero on every call (you're paying the write premium every time and never reading). Cause: something in the prefix changes between calls. Common offenders: timestamps, request IDs, session IDs, model version strings, dynamically generated instruction text. Fix: log the cached prefix from two consecutive calls and diff them. 3. Prefix more than 20 blocks deep. Symptom: cache works for short conversations but stops hitting after several turns. Cause: Anthropic looks back through the last 20 content blocks for a match. Conversations with many image blocks, tool-result blocks, or long structured content hit the limit fast. Fix: consolidate the cached prefix into a single content block at the start. 4. TTL expired. Symptom: cache hits intermittently — works during active use, misses after idle periods. Cause: 5-minute ephemeral cache expired. Fix: move to the 1-hour tier (different type value, slightly higher read cost), or accept the miss for low-traffic routes. 5. Cross-region cold. Symptom: cache hits in one region, misses in another. The cache is region-scoped. Fix: pin requests to a single region if cache-hit-rate is critical, or accept the regional warm-up cost.

Where caching doesn't help

Three classes of workload where the discipline is wasted: Short prompts. Sub-1024-token system prompts can't cache. Most chatbot use cases that don't pass document context fall here. High-variance prompts. If each user generates a unique system prompt (per-tenant personalization, dynamic policy text), there's no shared prefix to cache. Architectural fix: separate the personalization into a non-cached suffix and cache the policy layer. Single-shot batch. Caching shines on repeated reads. A batch run that fires each prompt exactly once gets the write premium with no reads to amortize against. Use the Batch API discount alone here, not caching.

The Opus 4.7 tokenizer note

Opus 4.7's tokenizer produces up to 35% more tokens than 4.6 on the same input text (Anthropic's pricing docs note this directly; finout.io has the analysis). Headline $5/$25 per MTok is unchanged from 4.6, but real bills go up on identical workloads. Caching mitigates this because cache reads are billed at the cached count, which is stable across model upgrades once written. Teams running 4.6 → 4.7 should re-validate cache hit rates after the upgrade — the absolute hit count stays the same but the percentage shifts because total input tokens went up.

What to read next

Sources

Frequently asked

How much does prompt caching actually save on Anthropic?
Up to 95% on the cached portion, when batch and cache stack. Standard system-token input is $1.00 per million on Claude Haiku 4.5. The Batch API drops that to $0.50 (50% discount). A cache read on a prefix that's already warm drops it further to $0.05 (90% off the base price). The two discounts multiply: batch + cache lands at $0.05/MTok on the system-token portion, which is 95% off the headline rate. Anthropic's pricing docs mention the stacking exists; nowhere shows the worked example.
Does the Vercel AI SDK support Anthropic prompt caching out of the box?
Yes, since v5. You pass cache_control through providerOptions: providerOptions: { anthropic: { cacheControl: { type: 'ephemeral' } } } on the message you want cached. The SDK forwards it to the Anthropic Messages API. The wrinkle is that the AI SDK doesn't surface cache hit rates back to you — you read them from the response usage fields (cache_creation_input_tokens vs cache_read_input_tokens) and aggregate yourself.
What's the minimum prompt size for caching to work?
1024 tokens on Sonnet and Opus, 2048 tokens on Haiku. Below those thresholds the cache_control header is silently ignored. This is why teams new to caching report '50% of my requests have no cache hit' — they're trying to cache prompts that are too small. The fix is either (a) increase the prefix size by including more shared context, or (b) accept that small prompts don't cache and focus the discipline on the long-prefix endpoints.
What's the cache lookback window and why does it matter?
Anthropic looks back through the last 20 content blocks for a cache match. If your prompt has more than 20 distinct blocks before the cached prefix, the lookup misses. This is the second most common reason caches don't hit. Practical fix: keep your cached prefix in the first 1-3 blocks of the message array. The lookback detail is in the official caching docs but rarely cited in tutorials.
How long does the cache live?
Five minutes by default (ephemeral cache). Anthropic announced a 1-hour cache tier in 2025 with a different price point — read cost roughly $0.10/MTok instead of $0.05, write cost slightly higher. The 5-minute window is enough for conversation-style workloads where the user takes <5 min between turns. The 1-hour tier is for batch processing or long-running agents that revisit the same context.
Why does my cache never hit?
Three causes, in descending order of frequency. (1) Prompt is below the minimum token threshold (1024/2048). (2) The cached prefix isn't byte-identical across requests — even a single character difference invalidates. System date in the system prompt is the classic offender. (3) The cached block is further than 20 blocks from the start of the message array. The dashboard recipe in this page tells you which of the three is hitting you.

Related topics