Opus 4.7 prompt caching cost, the actual math.

Most write-ups stop at “up to 90% off.” That number is true on the right shape of workload and very wrong on the wrong one. Below is the per-line pricing, the break-even threshold (it is two requests, not two thousand), the cache-invalidation gotchas that quietly destroy your savings, and a worked monthly bill for a real three-person SaaS pipeline.

M
Matthew Diakonov
9 min read

Direct answer, verified 2026-04-29 against Anthropic docs

Opus 4.7 prompt caching pricing per million tokens

Base input
$5.00 / Mtok
Base output
$25.00 / Mtok
Cache write, 5-minute TTL (1.25x base)
$6.25 / Mtok
Cache write, 1-hour TTL (2x base)
$10.00 / Mtok
Cache read and refresh (0.1x base)
$0.50 / Mtok

Minimum cacheable prompt: 4,096 input tokens. Maximum 4 explicit cache breakpoints per request. 20-block automatic lookback. Cache reads refresh the TTL on a hit, so a cache stays warm as long as traffic is continuous within the window.

Source: platform.claude.com/docs/en/build-with-claude/prompt-caching.

The break-even is request two

People keep saying caching is for “heavy” workloads. That is wrong. Run the arithmetic on a single repeated context of size C tokens.

Without caching, two calls cost 2 × C × $5/M = 10C/M dollars. With caching, the first call writes (1.25x) and the second reads (0.1x), so the same two calls cost C × ($6.25 + $0.50)/M = 6.75C/M dollars. By request two, you are already 32.5% ahead. By request twenty, you are 88% ahead. The headline 90% number is just the asymptote you approach as the read count grows.

32.5%

Cheaper than the uncached path by your second call. There is no minimum scale at which caching turns on.

Cache write 1.25x + 1 read at 0.1x vs 2x base input

With cache vs without, on the same workload

Concrete shape: 100 calls in a 5-minute window, each carrying an identical 10,000 token system prompt + tools, plus 500 fresh input tokens and 200 output tokens. The fresh and output costs are unchanged in both setups. The cached prefix is where the delta lives.

100 calls, 10,000 token cached prefix

Every call resends and re-meters the full 10,000 token prefix. The model is the same. The output is the same. The bill is not.

  • Prefix: 100 x 10,000 x $5 / 1M = $5.00
  • Fresh input: 100 x 500 x $5 / 1M = $0.25
  • Output: 100 x 200 x $25 / 1M = $0.50
  • Total: $5.75

The $4.44 you save here scales linearly with both prefix size and request count. Multiply by the number of 5-minute windows in your day and you have a daily savings number you can put in a quote without hand-waving.

A worked monthly bill for a 3-person SaaS

The shape: a small Shopify support tool that classifies incoming tickets and drafts a first reply. 10,000 tickets a day. An 8,000 token system prompt covering the brand voice, the policy doc, and tool definitions. 500 tokens of fresh ticket text per call. 200 tokens of output per call. Continuous traffic during business hours, gaps overnight. The numbers below are reproducible, with assumptions named so you can change them and re-run.

opus-4-7-monthly-bill.md

The cache_writes_per_day = 12 is the only number that needs judgment. If your traffic is fully continuous, you can hit it as low as ~5 (the cache never expires while reads keep refreshing it). If your traffic is bursty with multiple multi-minute gaps per hour, push it to ~50. The savings hold either way: even at 100 writes a day the cache_writes line is $5, which barely changes the total.

The three caches you actually need, in order

Most production setups need three breakpoints, not four. The fourth is usually a premature optimization that costs more in extra writes than it saves on reads.

Order them by stability, not by length

  1. 1

    1. System prompt + tools

    Brand voice, policy doc, JSON-schema tool definitions. Changes weekly at most. This is your highest-leverage cache.

  2. 2

    2. Conversation history

    For multi-turn agents. Cache up to and including the last assistant turn. Each new user turn appends after the breakpoint and reads everything before.

  3. 3

    3. Per-user reference context

    Order history, account profile, the customer's last 30 days of activity. Stable for the duration of one session, often re-used across two or three calls.

Five gotchas that quietly destroy your cache hit rate

Every one of these has shown up in a production codebase I have scoped. None of them throw an error. The only signal is a bill that looks the same as the no-cache version.

Things that invalidate a prefix without telling you

1

A timestamp templated into the system prompt

If your prompt opens with 'Today is 2026-04-29 at 14:32 UTC,' every minute is a fresh write. Move time-sensitive context after the cache breakpoint, or round to the day if the model only needs the date.

2

A user ID or email injected into the system block

Cache is positional, not semantic. A per-user variable in the cached segment forces a fresh write on every user. Move per-user data into the post-breakpoint message body.

3

tool_choice flipping between calls

Setting tool_choice on one request and unsetting on the next invalidates the message cache, even if the messages are identical. Lock tool_choice for the lifetime of a session if you can.

4

Re-ordering or editing tool definitions

Adding a new tool to the array re-hashes the prefix from the first changed byte forward. If you change tools often, put the volatile ones at the end and pin the stable ones at the front.

5

Image blocks added or reordered mid-session

Image content is part of the cache key. A vision app that adds a new screenshot per turn is essentially un-cacheable above the image breakpoint. Cache below the images, not above.

6

Falling under the 4,096 token minimum

If your system prompt is 3,800 tokens you are paying full freight every call and not realising. Pad with a stable reference block (a code-of-conduct, a glossary) or move the system content into a longer cached prefix.

What the request actually looks like

The whole feature is two extra fields. Here is the minimal Python shape: one cache_control breakpoint on the system block, then normal user messages. The first call writes, every subsequent call in the next 5 minutes reads. Verify with the usage block, do not trust the dashboard until you have logged this on at least the first hundred requests.

opus_47_caching.py

For the 1-hour TTL, the cache_control field is {"type": "ephemeral", "ttl": "1h"}. For batch, wrap the same body in a batch request and the read price drops another 50%. There is no other configuration. If your code is more complicated than this, you are caching the wrong thing.

What this costs as a one-off engagement

Wiring caching into an existing Opus pipeline is usually a half-day to a day of work for someone who has done it before. The payback on a real $14K monthly bill is the same week you ship. That is the kind of thing that fits cleanly inside the published tier ladder on the c0nsl homepage: a $500 to $2,000 small integration for a single-app cache rollout, the $2,000 to $10,000+ custom system tier if it comes with a real eval harness and a batch pipeline, or a $1,000 to $5,000 monthly retainer if you want the whole Anthropic line item under named ownership. Every number on this page is one I would put in a quote, with the same workings shown above.

The point of publishing the rate is exactly so the reader can do their own ROI math without a discovery call. If your monthly Opus bill is over $10K and you have not wired caching, the engagement cost is a rounding error against the savings in week one.

Walk your real Opus bill, with a named engineer

Bring last month's Anthropic invoice and one production prompt. I come back with a worked caching plan, a ROI estimate against your actual workload, and a fixed-fee quote.

Frequently asked questions

What does Opus 4.7 prompt caching actually cost per million tokens in 2026?

Cache writes with the default 5-minute TTL are charged at 1.25x the base input price, which puts them at $6.25 per million tokens. Cache writes with the 1-hour TTL are 2x base, or $10 per million tokens. Cache reads are 0.1x base, or $0.50 per million tokens. Base input is $5 per million tokens and base output is $25 per million tokens. Those five numbers are the entire pricing surface. Verified against Anthropic's prompt-caching docs on 2026-04-29.

What is the smallest prompt that is even eligible for caching on Opus 4.7?

4,096 input tokens. Anything shorter than that is silently skipped: the API does not error, it just returns 0 in both cache_creation_input_tokens and cache_read_input_tokens. The first signal that you set up caching but it is not actually firing is a Stripe-style invoice that looks identical to the no-cache version. Always log usage on the first hundred requests of a new pipeline and check both fields.

How many cache breakpoints can I set on a single request?

Four. The system also does an automatic 20-block lookback per breakpoint, so it tries to find the longest cached prefix you have written before. In practice four is more than enough: most production stacks only need three (system prompt + tools, conversation history, the user's freshest message). Burning a fourth breakpoint for an extra micro-segment usually costs more in writes than it saves on reads.

What invalidates a cached prefix and forces a fresh write?

Any change to the bytes inside the cached segment, plus a few non-obvious ones. Changing tool_choice anywhere in the prompt invalidates the message cache. Adding, removing, or reordering image blocks invalidates it. Editing the tool definitions invalidates it. Even a single character change inside the system prompt invalidates everything from that point forward, because the cache is positional, not semantic. The most common production bug is a system prompt that templates in a timestamp or a per-user variable and silently destroys the cache on every call.

After how many calls does prompt caching pay for itself?

Two. The math: 1 write at 1.25x plus 1 read at 0.1x equals 1.35x base. Two uncached calls of the same content equals 2.0x base. So caching is already cheaper than a duplicate call by request 2 in the same 5-minute window. Every read after that is pure savings. People who say caching is only worth it 'at scale' have not run this calculation. It is worth it on day one if your workload has any repetition.

Does prompt caching stack with the Batch API discount?

Yes. Batch is a flat 50 percent off the metered rate, applied on top of the cache pricing. So a cached read inside a batch request lands at $0.25 per million tokens, a cached write lands at $3.125, and base input lands at $2.50. The one limitation is that max_tokens: 0 pre-warming requests are rejected inside batches, because pre-warming is meant to drop time-to-first-token and batches do not have one. For any non-realtime workload (overnight document processing, scheduled reports, eval runs), batch plus cache is the cheapest seat in the house.

Should I use the 5-minute or the 1-hour TTL?

Use the 5-minute TTL by default. As long as your pipeline has continuous traffic, every read refreshes the TTL, so the cache stays warm without you having to upgrade. Reach for the 1-hour TTL when you have bursty traffic with gaps longer than 5 minutes (a support pipeline that goes quiet between 2 and 6 a.m., a finance batch that runs once an hour). The math: a 1-hour write costs 2x base versus the 5-minute's 1.25x. If you would otherwise have to re-write the cache 6 times in an hour to keep it alive (1.25x times 6 = 7.5x), the 1-hour write pays for itself by the second cycle.

How do I check that caching is actually happening on my account?

Read the usage block on every response. The fields are cache_creation_input_tokens (what was just written), cache_read_input_tokens (what was just read from a prior write), and input_tokens (what was sent fresh, after the last breakpoint). On the first request of a new conversation you should see cache_creation roughly equal to your system prompt size. On the second request, cache_creation should drop to zero and cache_read should be roughly equal to that same number. If neither field is moving you have not crossed the 4,096 token minimum, or you have a templated value invalidating the prefix.

What is a realistic monthly cost for a small SaaS team that wires this up correctly?

A 3-person SaaS doing 10,000 daily ticket classifications on Opus 4.7 with an 8K cached system prompt, 500 fresh input tokens per ticket, and 200 output tokens per ticket lands at roughly $115 per day, or about $3,450 per month. The same workload run uncached is closer to $14,250 per month. The savings come from the cached portion, not the fresh portion: the 500 fresh input tokens and the 200 output tokens are unchanged in either setup. Wiring the cache correctly is the difference between a $3.5K monthly bill and a $14K one for the exact same product behavior.