The budget_tokens knob is gone. The latency tradeoff did not go with it.
Most posts on this topic still tell you to set a thinking token budget. On Opus 4.7 that exact request returns a 400 error. The tradeoff is real, the knobs changed, and the posts have not caught up. Five effort levels, one underdocumented display flag, three providers with a 3.5x time-to-first-token spread. Here is what each one actually does, with numbers I pulled on 2026-05-04.
Direct answer, verified 2026-05-04
How much latency does Opus extended thinking add?
Roughly 9 to 32 seconds of time to first token on Opus 4.7 with adaptive thinking on, depending on provider and effort. You no longer set a thinking token budget directly: as of Opus 4.7,thinking: {type: "enabled", budget_tokens: N}returns 400. The replacement isoutput_config.effortwith five levels (low, medium, high, xhigh, max) plus thethinking.display: "omitted"flag for streaming surfaces.
Sources: Anthropic extended thinking docs, effort parameter docs, Artificial Analysis Opus reasoning provider page.
What changed on Opus 4.7
The old playbook was simple: enable thinking, setbudget_tokensto a number, watch the model burn that many internal tokens before answering. Tune the number, tune the latency. That contract was stable from the first reasoning models through Opus 4.6.
Onclaude-opus-4-7that exact request body returns a 400. The platform docs are blunt: “Manual extended thinking is no longer accepted and returns a 400 error.” The replacement is adaptive thinking,thinking: {type: "adaptive"}, with depth controlled by a separateoutput_config.effortfield. Adaptive thinking decides whether to think on a given prompt; effort decides how deep to think when it does.
That is a real architectural shift, not a flag rename. You are no longer pinning a number; you are giving the model a behavioral signal and letting it decide. For prompts where you used to set a small budget to keep latency tight, adaptive thinking will often skip the thinking phase entirely now. For prompts where you used to set a large budget, the newxhightier exists specifically to cover long-horizon agentic work.
“Time-to-first-token spread for Claude Opus 4.7 reasoning on identical model weights, between Amazon Bedrock (32.50 s) and Google Vertex (9.29 s) at default effort. The serving stack moves the wall clock more than the prompt does.”
Artificial Analysis, Claude 4 Opus thinking provider page, snapshotted 2026-05-04
The five effort levels, what they actually mean for latency
Effort is supported on Opus 4.7, Opus 4.6, Sonnet 4.6, Opus 4.5, and Mythos Preview. Five levels, with the newxhightier exclusive to 4.7. The docs describe what each one is for; the piece they understate is the latency shape, which is what matters if you are wrapping this in a chat surface or a coding agent.
| Effort level | Where supported | Use it for | Latency shape |
|---|---|---|---|
| low | Opus 4.7, 4.6, Sonnet 4.6 | Short scoped tasks, classification, routing subagents. May skip thinking entirely. | fastest, often sub-2 second TTFT on a non-thinking turn |
| medium | Opus 4.7, 4.6, Sonnet 4.6 | Drop in for the average workflow where you want good results at lower cost. | moderate thinking, TTFT typically 4-8 seconds |
| high (default) | All effort-supporting models | Complex reasoning, difficult coding, agentic tasks. Equivalent to omitting the parameter. | default benchmarked TTFT (Vertex 9.29s, Anthropic 10.04s, Bedrock 32.50s) |
| xhigh | Opus 4.7 only | Recommended starting point for coding agents and long-horizon agentic work with millions of tokens. | deeper thinking, TTFT 12-20 seconds typical, more output tokens |
| max | Mythos Preview, Opus 4.7, 4.6, Sonnet 4.6 | Reserve for genuinely frontier problems. Often overthinks structured-output tasks. | deepest thinking, TTFT 17+ seconds on Anthropic at max effort |
Latency shapes are blended observations from Anthropic's documentation and Artificial Analysis benchmarks for Opus 4.7 reasoning at default effort, plus the page on this site that walks wall-clock numbers per agent turn. Re-pull on the day you make a decision; provider stacks change.
The lever nobody mentions: thinking.display = omitted
Hidden in the streaming section of the extended thinking docs, there is a single configuration value that drops perceived latency without changing model behavior at all:thinking: {type: "adaptive", display: "omitted"}. Set it, and the server stops streamingthinking_deltaevents. The model still thinks; you still pay for the thinking tokens; the cache still works. What changes is that the wire payload between the model finishing thinking and the user seeing the first character of the answer drops from N seconds of streamed internal reasoning to a single signature event.
For a chat surface that does not surface thinking, this is free latency. For a coding agent that pipes the response into a tool call without rendering it, this is also free latency. The catch is that any UI feature that depends on watching the thinking stream (a thought bubble, a debugger view, a running summary) will go quiet. If you have one of those, leave display on the default. If you do not, setomittedand reclaim the streaming window.
The provider you point at moves the number more than the prompt
Same model code, same weights, different infrastructure. Pull the Artificial Analysis provider page for Opus 4.7 reasoning. Sort by time to first token. The numbers as of 2026-05-04:
| Provider | TTFT | Output speed | Notes |
|---|---|---|---|
| Google Vertex | 9.29 s | 34.2 t/s | Fastest of the three on 4.7 reasoning. |
| Anthropic API | 10.04 s | 33.3 t/s | Reference baseline; first-party serving stack. |
| Amazon Bedrock | 32.50 s | 17.6 t/s | Roughly 3.5x the TTFT of Vertex on identical weights. |
That is a 3.5x spread for free, before you touch effort, before you touch display, before you change a single character of the prompt. For a streaming chat surface, the difference between Vertex and Bedrock is the difference between a tolerable wait and a user assuming the page hung. If you can pick the provider, this is the cheapest latency win available. If you cannot (compliance, vendor lock, an org-wide AWS bill), the rest of the knobs on this page do more of the work.
Four numbers I track on every Opus deployment
Three of these come from public benchmarks; the fourth is what you pay for max effort on Anthropic's own API in adaptive reasoning, sourced from the wall-clock work on a sibling page on this site.
When extended thinking earns its latency
Anthropic's own 4.7 announcement reports a 13 to 14 percent lift over 4.6 on coding benchmarks at fewer tokens, and roughly a third of the tool errors. That gap is large enough to matter on ambiguous, multi-file, or cross-system work, where a wrong diff triggers another full agent loop and another wall-clock window. On those tasks the math comes out cleanly: an extra 10 to 17 seconds of TTFT at xhigh or max is cheaper than rerunning a 4-minute agent turn at high.
On the other side: codemods, dependency bumps, schema migrations, doc regen, classification, simple lookups. Adaptive thinking will often skip the thinking phase on these by itself; pinning effort to low or medium just confirms what adaptive was going to do anyway. The retry rate gap closes to nothing because there was nothing to retry. Extended thinking on these tasks adds latency and burns thinking tokens for no measurable quality gain. The useful filter is not “is this hard” but “does this task have ambiguity that a deeper reasoning pass could resolve”.
How I wire this up in client work
Three lanes, two questions. The whole router fits in fifty lines and outlives the next model release because it does not encode specific model names, only intent.
- Lane 1, interactive chat surfaceHuman is staring at the stream
Provider: Vertex or Anthropic, never Bedrock for this lane. Effort: medium for routine, high only when the prompt looks ambiguous. display: omitted, always. If sub-2 second TTFT is required, route to Haiku 4.5 instead of Opus and let the planner promote upward.
- Lane 2, coding or agentic loopMultiple tool calls, ambiguous PR, retry-expensive
Provider: Anthropic with prompt caching wired correctly. Effort: xhigh by default, max only on prompts where evals showed measurable headroom. display: omitted because no human is reading the thinking stream. Set max_tokens to 64k so the model has room to think across subagents.
- Lane 3, batch or backgroundCodemods, doc regen, scheduled jobs, classification
Provider: cheapest available with the prompt cache hit rate you can actually achieve. Effort: low or medium. display: omitted. Adaptive thinking will skip thinking on most prompts in this lane, which is the point.
The router is the one piece I always pull out of legacy code on day one of an engagement. Most teams have hardcodedbudget_tokens: 8000from a tutorial that predates 4.7, are getting a 400 from the API on every request, and have a fallback path silently dropping back to a non-thinking model. Fixing that is the smallest possible engagement that pays for itself the same week.
Related reading on this site
For the cost side of the same picture, the Opus 4.7 prompt caching page walks the bill per agent turn with caching wired correctly. For the comparison against open weights, the open source coding agent latency vs Opus page graphs the 24x throughput spread on the same DeepSeek model across providers, which is the same shape of variance the table above shows for Opus on its three first-class providers.
Bring your current Anthropic request body
If you are still sending budget_tokens to claude-opus-4-7, you are getting 400s. I will read your actual request shape, run the wall-clock numbers on your traffic, and quote a fixed-scope router and effort policy. $75 for the call.
Frequently asked questions
How much latency does Claude Opus extended thinking actually add?
On Claude Opus 4.7 with adaptive thinking on, the time to first text token sits roughly between 9 and 32 seconds at default effort, depending on the provider. Artificial Analysis (snapshotted 2026-05-04) measures Google Vertex at 9.29 seconds, Anthropic's own API at 10.04 seconds, and Amazon Bedrock at 32.50 seconds for the same model at the same default effort. Pushing effort up to xhigh or max can roughly double those numbers because the model is genuinely thinking longer; pushing effort down to low can collapse them by skipping the thinking phase entirely on simple prompts. The provider you point at and the effort level you set are independent multipliers; the spread between fast and slow is more than 3x before you change a single token.
Can I still set a thinking token budget on Opus 4.7?
No. Sending thinking: {type: "enabled", budget_tokens: N} to claude-opus-4-7 returns a 400 error. The platform docs say it explicitly: manual extended thinking is no longer accepted on 4.7. The replacement is thinking: {type: "adaptive"} plus output_config.effort, where effort takes the values low, medium, high (default), xhigh, or max. The xhigh tier is new on 4.7 and is the recommended starting point for coding and agentic work. budget_tokens still works on Opus 4.6 and Sonnet 4.6 for now, but the docs flag it as deprecated and slated for removal.
What is display:"omitted" and how much does it actually save?
When you stream a response from Opus 4.7 with thinking enabled, the default behavior is for the server to emit a sequence of thinking_delta events first, then signature, then the text content block. Setting thinking.display: "omitted" tells the server to skip streaming the thinking_delta events entirely. You get one content_block_start for the thinking block, a single signature_delta, a content_block_stop, and then the text block starts streaming immediately. The wall clock from request to first visible character drops by however long the model spent generating internal reasoning tokens. On a real call where adaptive thinking decided to spend 5,000 tokens reasoning at roughly 50 tokens per second, that is about 100 seconds of streamed reasoning the user never had to watch. If your UI does not surface thinking, this flag is free latency.
Which effort level should I default to for Opus 4.7?
The platform docs say start at xhigh for coding and agentic work, and use high as the minimum for intelligence-sensitive workloads. Step down to medium for cost-sensitive workloads, and only step up to max when your evals show measurable headroom at xhigh. In practice for client work I default to high for chat-style consults, xhigh for coding agents that loop on tool calls, and low for short classification or routing subagents. Note that Opus 4.7 respects effort more strictly than 4.6, so a request that ran fine on 4.6 medium may need to be moved to high on 4.7 to keep the same depth.
Why is Amazon Bedrock so much slower than Google Vertex for the same Opus model?
Same weights, different infrastructure. Bedrock currently sits at 32.50 second time to first token on Opus 4.7 reasoning while Vertex sits at 9.29 seconds, against Anthropic's own 10.04 seconds (Artificial Analysis, 2026-05-04). The model code is identical; the serving stack, the batching policy, and the regional capacity are not. For an interactive coding agent or a chat surface, that difference is the entire user experience. If you are stuck on Bedrock for compliance reasons you cannot work around, push effort to low and use display: omitted aggressively, then route only the genuinely hard prompts upward. If you are not constrained, Vertex or the Anthropic API is the saner default.
Does extended thinking actually help on coding tasks enough to justify the wait?
On the right tasks, yes. Anthropic's 4.7 announcement reports a 13 to 14 percent gain over 4.6 on coding benchmarks at fewer tokens, with roughly a third of the tool errors. That gap is large enough that a single retry on a wrong diff (another full agent loop, another 12,000 output tokens, another 17 seconds of TTFT) usually costs more wall clock than running at xhigh would have in the first place. The cleanest way to think about it is amortized seconds per shipped PR, not seconds per turn. Extended thinking earns its latency on ambiguous, multi-file, or cross-system work; it is overkill on codemods, dependency bumps, and well scoped one-file edits.
Should I let adaptive thinking decide, or pin effort manually?
Both. Adaptive thinking decides whether to think on a given prompt; effort decides how deep to think when it does. Leaving thinking on adaptive and pinning effort to high or xhigh gives the model permission to skip thinking on a trivial prompt while still reasoning hard on a complex one. The combination is the cheapest way to keep latency low on the easy 60 percent of prompts and quality high on the hard 40 percent, without writing your own classifier in front of the API. The opposite, manual budget_tokens with a fixed N, is no longer an option on 4.7 anyway.
How do I measure this on my own traffic instead of trusting the benchmarks?
Three numbers. First, time from request sent to the first text content_block_delta event (not the first thinking_delta). Second, total wall clock from request to final stop event. Third, output tokens from the response usage block. Log all three per request, group by effort level and provider, look at p50 and p95. Two weeks of that on real traffic beats every published benchmark, because the prompts that drive your wall clock are not the prompts the benchmarks use. If you want me to set up the logging plus a routing rule on top of it for your team, that is what the call is for.