Open source coding agents do not have one wall clock number.

Every thread arguing “DeepSeek is sixty times cheaper than Opus, why would anyone pay for Anthropic” is comparing list prices. The list price is one input. The provider you point your coding agent at is the input nobody graphs, and it is wider than the model gap. The same DeepSeek V3.2 model on Google Vertex outputs 218.7 tokens per second; the same model on DeepInfra FP4 outputs 9.0 tokens per second. That is a 24x spread on identical weights, before you have decided which agent loop wraps the call.

Matthew Diakonov, Written with AI

Published May 4, 20268 min read

Direct answer, verified 2026-05-04

Open source coding agent latency cost vs Opus 4.7

The latency cost of an open source coding agent depends almost entirely on the provider you point it at, not the model name on the API call. The same open source model varies up to 24x in throughput across hosting providers. Opus 4.7's $5 / $25 per million token list price can deliver a lower wall clock cost than a cheap provider serving a fast open source model, and a fast provider serving a frontier open source model can deliver a lower wall clock cost than Opus by 4x. Pick the provider before you argue the model.

Fastest open source path

Cerebras Qwen3 235B

~1,500 t/s, sub 0.5 s TTFT, premium per token

Sweet spot

Vertex DeepSeek V3.2

218.7 t/s, 0.83 s TTFT, ~$0.10 per agentic turn

Quality ceiling

Anthropic Opus 4.7

50.2 t/s, 17.24 s TTFT, ~$1.90 per agentic turn

Sources: Artificial Analysis (DeepSeek V3.2 providers), Artificial Analysis (Qwen3 Coder 480B providers), Anthropic pricing, DeepSeek API pricing.

The 24x spread on the same model

Pull the Artificial Analysis provider page for DeepSeek V3.2 reasoning. Sort by output speed. Google Vertex sits at the top with 218.7 tokens per second and 0.83 second time to first token. Nebius Fast is next at 125.2 tokens per second. SambaNova at 117.1. DeepInfra FP4 is at the bottom at 9.0 tokens per second with 1.32 second time to first token. Same model. Same weights. Same SWE-bench number. The serving stack and the quantization choice change the wall clock by an order of magnitude and a half.

Qwen3 Coder 480B has the same shape. Eigen AI hits 256.4 tokens per second on it. Google Vertex 169.5. Amazon Bedrock 121.3. DeepInfra FP8 81.1. DeepInfra Turbo FP4 42, but with the lowest time to first token in the set at 0.53 seconds, which makes it the right choice for the start of an interactive turn even though it is not the fastest at sustained output. The point is that “Qwen3 Coder is fast” or “Qwen3 Coder is slow” is not a true sentence. It depends on which provider is serving it.

24x

“Throughput spread for the same DeepSeek V3.2 model across hosting providers. Google Vertex at 218.7 tokens per second versus DeepInfra FP4 at 9.0 tokens per second on the public Artificial Analysis provider page.”

Artificial Analysis, DeepSeek V3.2 reasoning providers, snapshotted 2026-05-04

The provider table that should be on every comparison page

One table. Output throughput, time to first token, indicative price line. Read it across the rows, not down them. The fastest open source path is more than two orders of magnitude faster than the slowest open source path on the same weights, and the interesting question is where Opus 4.7 sits in that distribution.

Provider	Model	Output speed	TTFT	Indicative price
Google Vertex	DeepSeek V3.2 reasoning	218.7 t/s	0.83 s	$0.28 /Mtok / $0.42 /Mtok
Eigen AI	Qwen3 Coder 480B	256.4 t/s	0.73 s	varies
Google Vertex	Qwen3 Coder 480B	169.5 t/s	0.72 s	$0.61 /Mtok blended / JSON + tools
Amazon Bedrock	Qwen3 Coder 480B	121.3 t/s	n/a	varies
DeepInfra (FP8)	Qwen3 Coder 480B	81.1 t/s	n/a	varies
DeepInfra (Turbo, FP4)	Qwen3 Coder 480B	42 t/s	0.53 s	$0.41 /Mtok blended / blended
DeepInfra (FP4)	DeepSeek V3.2 reasoning	9.0 t/s	1.32 s	$0.025 /Mtok cache hit / $0.42 /Mtok
Anthropic API	Opus 4.7 max effort	50.2 t/s	17.24 s	$5 /Mtok ($0.50 cache) / $25 /Mtok
Cerebras	Qwen3 235B	~1,500 t/s	<0.5 s	premium

Numbers from the Artificial Analysis provider pages and Anthropic pricing, snapshotted 2026-05-04. Prices vary by provider and change frequently; the structure of the spread is the part that holds. Re-pull these numbers on the day you make a decision.

What an agent loop turn actually costs in wall clock

The unit that matters is not a token. It is one turn of the coding agent loop: a prompt with cached context, a model call, a tool execution, and a result. A 1,500 line PR in Aider or OpenCode or Cline is roughly five of those turns. The wall clock on each turn is time to first token plus output tokens divided by throughput, plus tool latency, plus retries on bad diffs. Three of those four are dominated by the provider, not the model.

One turn of an open source coding agent loop

On the canonical 250,000 token cached prefix plus 12,000 output tokens turn, the math falls out cleanly. Vertex DeepSeek lands the output in 55 seconds plus 0.83 seconds TTFT. DeepInfra FP4 lands it in 1,333 seconds plus 1.32 seconds TTFT, more than twenty minutes for the same answer. Opus 4.7 lands it in 240 seconds plus 17.24 seconds TTFT. Cerebras lands it in 8 seconds plus sub half second TTFT. The bills move on a different axis; the wall clock moves on this one.

The four numbers that decide

Pull these for your own stack before you argue list price. Cerebras numbers come from public Hacker News reporting; the rest come from Artificial Analysis and Anthropic's own pricing page on 2026-05-04.

0 t/sVertex DeepSeek V3.2

0 t/sAnthropic Opus 4.7

0 t/sDeepInfra FP4

0 t/sCerebras Qwen3 235B

The agent layer is also a variable

The conversation usually skips this entirely. Aider, OpenCode, Cline, Continue, Roo Code, and Goose are not the same shape of loop. They make different decisions about how aggressively to cache, how many turns to run before checking back with the engineer, how much code they paste into the next prompt, and whether they batch tool calls. Two agents pointed at the same model and the same provider can produce a 2x difference in turns per PR, which is a 2x difference in API bill and wall clock.

Aider is the smallest of the named agents and the most disciplined about edit format; on git native PRs with a tight review loop it tends to land in fewer turns. OpenCode crossed 147,000 GitHub stars and 6.5 million monthly developers in April 2026 by being the most provider agnostic of the bunch; the provider configuration surface is its primary product differentiator. Cline and Roo Code put the agent inside the IDE with inline diff acceptance, which trades token efficiency for review ergonomics. Continue is closer to a model-routing framework with an editor surface bolted on top.

None of those is wrong. They are different points on a frontier. The right choice for a small team is whichever one matches the review style the engineer already uses; the wrong choice is whichever one the latest YouTube tutorial happens to feature this month.

The routing rule I use in client work

Three lanes, two questions. The whole decision tree fits on a single page of code and inside a small team's head.

Lane 1, interactive
Human is staring at the token stream right now
Sub two second time to first token is the only metric that matters. Use Haiku 4.5 on Anthropic, Qwen3 Coder 480B on Cerebras, or DeepSeek V3.2 on Google Vertex. Escalate to Opus 4.7 only on judgment calls; the 17 second time to first token breaks flow on routine work.
Lane 2, ambiguous
The PR needs a judgment call from a senior engineer
Subtle race condition, cross subsystem refactor, security boundary, ambiguous API contract. Send to Opus 4.7 with 5 minute prompt caching. SWE-bench Pro retry rate gap is the difference between one shipped PR and a re rolled review cycle, and at $150 an hour fully loaded the engineer time delta dominates the API delta.
Lane 3, well scoped
Codemods, dependency bumps, schema migrations, doc regen
Cheapest provider for the highest quality open source model you can get cheap. DeepSeek V3.2 with cache hit pricing or Qwen3 Coder on DeepInfra Turbo. The wall clock is fine because nobody is waiting; the bill is the binding number.

The whole router is fifty lines of code and one config file. Once it exists, swapping providers is a one line change and the agent loop never knows. That is the entire point of standing on open source coding agents in the first place: the model is a variable, not a vendor lock.

What the cheap-versus-frontier discourse keeps missing

Three things, every time. First, the per-token list price is one input among four; the others are provider throughput, retry rate, and the engineer's loaded hourly. Three of those four do not appear on any vendor pricing page. Second, the comparison is almost always written by someone selling either the cheap model (course operators promising AI agency riches) or the frontier model (vendor blog posts). The middle ground, which is picking the right provider for the right open source model and then measuring on your own traffic, is not on anybody's sales script. Third, the open source coding agent layer (Aider, OpenCode, Cline, Continue) is a separate variable from the model, and most of the published comparisons collapse the two.

None of that requires a course or a $25,000 retainer to fix. Pull your last month of agent traffic, run the cost-per-shipped PR calculation on it, decide which lane each call belongs in, and ship a fifty line router. The savings are real and the quality floor does not move.

Frequently asked questions

What is the actual latency cost of an open source coding agent compared to Opus 4.7?

There is not one number. There is a 24x spread for the same open source model across providers. As of 2026-05-04, Artificial Analysis lists DeepSeek V3.2 reasoning at 218.7 tokens per second on Google Vertex and 9.0 tokens per second on DeepInfra FP4, with time to first token ranging from 0.83 seconds on Vertex to 1.32 seconds on DeepInfra. Opus 4.7 in adaptive max-effort reasoning runs at roughly 50.2 tokens per second on Anthropic's own API with a 17.24 second time to first token. So an open source model on a fast provider can be wall clock cheaper than Opus 4.7 by 4x; the same model on a slow provider can be wall clock more expensive. Picking the model and ignoring the provider is the comparison error nobody is graphing.

Why is provider variance so wide for the same open source model?

Open source models are weights, not infrastructure. Each provider picks a quantization (FP16, FP8, FP4), a serving stack (vLLM, SGLang, TensorRT-LLM, custom), a batching strategy, and a hardware fleet. FP4 quantization on commodity GPUs gives you the cheapest list price and the slowest wall clock. FP8 on H100s with speculative decoding gives you mid-tier price and good throughput. Custom silicon (Cerebras, Groq) running carefully tuned attention kernels gives you 1500 plus tokens per second at a higher per-token rate. Closed APIs like Anthropic do not show you the dial; you get what they ship. Picking the open source model without picking the provider is like picking a car without picking the engine.

Which coding agents actually let me swap providers behind the same model?

All the open source ones, by design. Aider takes any provider that speaks an OpenAI compatible API plus the official Anthropic and Google clients; you set the base URL and the model name. Cline, Roo Code, and Continue ship provider pickers in the IDE settings panel and let you put DeepInfra or Together or Fireworks behind a Qwen3 Coder name. OpenCode, the terminal agent that crossed 147,000 GitHub stars in April 2026, is built around a provider abstraction that routes the same model name to whichever endpoint you configured. Closed agents like Cursor and GitHub Copilot pick the provider for you. The choice is not just open weights versus closed; it is also closed routing versus your own routing.

What does a real coding agent turn cost on the fastest open source provider versus Opus 4.7?

Take the canonical Claude Code style turn: a 250,000 token cached prefix (tool definitions plus file reads plus accumulated diffs), 8,000 tokens of fresh input, 12,000 tokens of output. On DeepSeek V3.2 with cache hit pricing at $0.028 per million cached input and $0.42 per million output, the bill is roughly $0.014 to $0.10 per turn depending on cache state. On Opus 4.7 with 5 minute prompt caching wired correctly, the bill is roughly $1.55 to $2.10 per turn. Wall clock on Google Vertex DeepSeek runs the 12,000 output tokens in about 55 seconds plus the 0.83 second time to first token. Wall clock on Opus 4.7 max effort runs in about 240 seconds plus the 17 second time to first token. The same turn on DeepInfra FP4 runs in 1,333 seconds, more than twenty minutes, which makes the cheap list price irrelevant if a human is waiting.

Does the SWE-bench gap matter once I am picking on latency cost?

Yes, because retries are billed twice. As of 2026-05-04, GLM-4.7 lands at 74.2 percent on SWE-bench Verified, Qwen3-Coder Next at 70.6 percent, DeepSeek V3.2 at 70.2 percent, and Opus 4.7 at 64.3 percent on the harder SWE-bench Pro. A model that gets the diff wrong on the first attempt pays for a retry: another full agent loop, another set of tokens, another wall clock window with the engineer either staring at the stream or context switching to something else and losing fifteen minutes. The cleanest way to think about it is amortized cost per shipped PR, not cost per turn. Cheap and fast loses to slightly more expensive if the retry rate gap is more than five points and the engineer is loaded above $100 an hour.

Where does Cerebras fit in this picture?

Cerebras runs Qwen3-235B at roughly 1,500 tokens per second on its WSE-3 wafer scale silicon, which is roughly 30x the throughput of Opus 4.7 in adaptive max-effort reasoning and roughly 7x the throughput of Google Vertex serving DeepSeek V3.2. The list price is higher than commodity GPU providers but the wall clock collapses by a full order of magnitude. For an interactive coding agent where a human is at the keyboard waiting on a stream, Cerebras with a frontier open weight is currently the most aggressive answer to the latency cost question. The catch is availability: throughput at that level burns capacity fast, and provider quotas can become the binding constraint instead of price.

What is the right routing rule for a small team?

Three lanes, two questions. First, is a human staring at the token stream right now. If yes, route to a fast small model on the fastest provider you have access to: Haiku 4.5 on Anthropic, Qwen3 Coder 480B on Cerebras, DeepSeek V3.2 on Google Vertex. Sub two second time to first token is non negotiable. Second, is the PR ambiguous and is the engineer time loaded above $150 an hour. If yes, route to Opus 4.7 with prompt caching and accept the 17 second time to first token because the retry rate is the lowest available. Third, everything else, the well scoped repetitive PRs, route to the cheapest provider for an open source model with at least 70 percent on SWE-bench Verified. This is fifty lines of routing code and it is the highest leverage architecture decision a coding agent stack can make.

Which open source coding agents am I actually using behind this routing layer?

Aider for git native single PR work where the diff is reviewed before commit. OpenCode for terminal centric multi file work where I want a custom skill set and a fast iteration loop; OpenCode crossed 6.5 million monthly developers in April 2026 by being the most provider agnostic of the bunch. Cline or Roo Code in the IDE for inline review and acceptance flows on shorter changes. The agent picks the loop shape; the model picks the cost ceiling; the provider picks the wall clock. All three are independent and most teams only think about the second one.

Why publish numbers like this for free?

Most of the AI consulting content you see on this part of X is sold by course operators charging $5,000 to teach you to be them. I publish my consultation rate ($75) and my project tiers ($500 to $10,000+) on the homepage because the math should survive being seen. If your team can re-run the cost per shipped PR calculation on your own traffic and arrive at a different answer, that is a better starting point for an engagement than a sales call. The page is also the single place I can point to when a Twitter reply gets longer than three tweets.