DeepSeek vs Opus 4.7. The wall clock cost per PR.

Every comparison thread on this argues per-token list price (where DeepSeek wins by 60x) or SWE-bench percentages (where Opus 4.7 wins by 8.9 points). Both miss the number that actually shows up on a $150/hour engineer's monthly invoice: cost per shipped PR. I run both, on real codebases, and the answer flips depending on whether the PR is ambiguous or well scoped. Here is the worked example.

M
Matthew Diakonov
7 min read

Direct answer, verified 2026-05-03

One 1,500-line PR, side by side

Opus 4.7 API cost
~$1.90
cached 250K prefix + 8K fresh + 12K out
DeepSeek V3.2 API cost
~$0.10
cache hit prefix + flat $0.42/M out
Opus 4.7 wall clock
10 to 25 min
50.2 t/s, 17.24s TTFT max-effort
DeepSeek V3.2 wall clock
8 to 15 min
97 t/s, 1.32s TTFT (DeepInfra)
Cost-per-shipped-PR with $150/hour engineer
Opus ~$26.90 vs DeepSeek ~$20.10 on well-scoped work
On ambiguous PRs, the SWE-bench Pro retry rate gap shrinks this to ~$10 and review risk often inverts it.

Sources: Anthropic pricing, DeepSeek API pricing, Artificial Analysis (Opus 4.7), Artificial Analysis (DeepSeek V3.2).

The bill is not what you think it is

The X thread argument always frames this as “DeepSeek is 60x cheaper, why would anyone pay for Opus.” That math is right on the per-token line and wrong on the per-PR line. A PR is not a token. A PR is a unit of shipped work, with three line items: the API bill, the engineer's wall-clock time wrapping that work, and the cost of a regression when the model returns a confident wrong answer that the reviewer waves through.

Engineering teams that have already swapped to a cheaper model and gone back, almost always went back because of line three. The bill was lower, the wall-clock was lower, and one production incident burned six months of API savings.

What one agentic PR turn actually looks like

Roughly the same shape on either model in a Claude Code or Aider-style loop. The cached prefix grows as the agent reads files and accumulates diffs. The fresh input is the user's next instruction or the result of the previous tool call. The output is the next diff plus a short narration.

One PR, five turns of agentic loop

engineeragentmodel APIrepotask: refactor Xread 12 filesprompt + 250K cached prefixdiff + planapply diff, run teststest failures + ask for fixfix diffPR ready for review

Five turns is typical for a 1,500-line PR. Each turn is a separate API call. Each call meters output tokens (the expensive line), reads the cached prefix (cheap), and adds a small fresh input chunk. The model that takes one extra turn to land the correct diff is paying twice: once on the bill, once on the engineer's wait.

The arithmetic, with assumptions named

Below is the same calc I run before quoting a fixed-scope engagement. Every input is editable; the structure is the part that matters. Change the engineer rate, the prefix size, or the retry rates to your team's reality and re-run.

cost-per-pr.yml

The two numbers most people get wrong: cached_prefix_tokens (Claude Code style agents push this to 200K to 400K easily on a real repo) and the retry rate. SWE-bench Pro is the closest public proxy I have seen for retry rate on real GitHub issues; it is not perfect but it is honest.

$10

The cost-per-PR delta on ambiguous work, after retries. Less than the price of one engineer review cycle. The right answer is not always the cheap model.

Worked calc above, $150/hour engineer, SWE-bench Pro retry rates applied

DeepSeek vs Opus 4.7 on a single PR, the same 1,500 lines

Toggle between the two to see the bill, the wall clock, and the engineer-time line on the same PR shape. The DeepSeek version wins outright on a clean first pass. The Opus version wins more often on the second-attempt rate, which is invisible until you track it.

One 1,500-line PR, real agentic loop

Cheaper per token by 60x. Throughput on DeepInfra is roughly 2x Opus. SWE-bench Pro 55.4 percent. Best on well-scoped work where the first plausible diff is usually correct.

  • API: ~$0.10 per PR with cache hit
  • Wall clock: 8 to 15 min on a clean pass
  • Throughput: 97 t/s, 1.32s TTFT (DeepInfra)
  • Retry rate (proxy): ~45 percent on hard PRs
  • Engineer time at $150/hr: ~$30 to $44 per PR

How I actually pick between them in client work

Two questions. First: would a junior engineer with detailed instructions ship this PR in one pass? If yes, DeepSeek. The work is well scoped, the test suite catches mistakes, the API bill is real money over thousands of PRs, and review time per PR is bounded.

Second: does shipping this PR require a judgment call (an ambiguous design choice, a subtle race, a behavior that has to match an undocumented contract)? If yes, Opus 4.7. The 8.9-percentage-point SWE-bench Pro gap is the difference between getting the right diff on the first attempt and burning a review cycle on a plausible-looking wrong fix. At a $150/hour loaded engineer, that one cycle costs more than 30 PRs of API bill.

For most teams I work with, the answer is both: route by difficulty. Linters, codemods, and dependency bumps go to DeepSeek. Production incidents, schema migrations, and any PR that touches the auth boundary go to Opus 4.7. The router is fifty lines of code, the savings are real, and the safety floor does not move.

Why I publish numbers like these

Most AI consultants on this part of X sell a cohort, a course, or a $25K “AI growth partner” retainer with hidden pricing. I publish my consultation rate ($75) and my project tiers ($500 to $10K+) on the homepage because the math should survive being seen. If a client can re-run my numbers on their own data and arrive at a different answer, that is a better conversation than a sales call.

For an actual engagement: I will sit with your last month of agent traffic, calculate cost-per-shipped-PR on whatever model you are running, and quote a fixed-scope migration or routing layer. The whole thing fits inside a $500 to $2,000 small integration for a single-pipeline rewrite, or the $2,000 to $10,000+ custom system tier if it includes a real eval harness.

Bring last month's agent invoice

I will run the cost-per-shipped-PR calc on your real traffic and either tell you the model swap is worth it or tell you it is not. $75 for the call. No course.

Frequently asked questions

What does a 1,500-line PR actually cost on Opus 4.7 versus DeepSeek V3.2?

On a typical agentic run with ~250,000 tokens of cached agent context (Claude Code style: tool definitions, file reads, edits, diffs) and ~12,000 output tokens, Opus 4.7 lands at roughly $1.55 to $2.10 per PR with 5-minute prompt caching wired correctly. DeepSeek V3.2 (chat or reasoner, both currently map to V3.2) lands at roughly $0.08 to $0.12 per PR with its native context cache enabled. The Opus number includes a one-time write at $6.25 per million tokens for the cached prefix and reads at $0.50 per million; the DeepSeek number uses cache hits at $0.028 per million and a flat $0.42 per million output. Verified against Anthropic's pricing page and DeepSeek's API docs on 2026-05-03.

Why does the SWE-bench Pro gap matter for cost per PR?

On SWE-bench Pro, which scores real GitHub issue resolution, Opus 4.7 lands at 64.3 percent and DeepSeek V4-Pro at 55.4 percent, an 8.9-point gap. On ambiguous PRs (debugging, refactors, novel features) that gap compounds across turns: a model that fails the first attempt on 35.7 percent of issues will need at least one retry on more than a third of your PRs. A retry doubles the wall-clock and the API bill simultaneously. On well-scoped, repetitive PRs (template changes, codemods, lint fixes) the gap mostly disappears and DeepSeek's cheaper rate dominates.

What about wall-clock throughput, is DeepSeek actually faster?

On the median agentic turn, DeepSeek V3.2 runs at roughly 97 tokens/second on DeepInfra with 1.32-second time-to-first-token. Opus 4.7 in adaptive max-effort reasoning runs at roughly 50.2 tokens/second on Anthropic's own API with a 17.24-second time-to-first-token. So a single Opus turn that emits 2,000 tokens of output is about 57 seconds; a DeepSeek turn that emits 2,000 tokens is about 22 seconds. Multiply by the number of turns the agent takes, factor in that reasoning-effort modes can push DeepSeek's time-to-first-answer past 2 minutes, and the wall-clock picture is much closer than the per-token throughput suggests. Numbers from Artificial Analysis as of 2026-05-03.

When is DeepSeek the right cost-per-PR choice?

When the PR is well-scoped, repetitive, and inside a codebase with strong tests. Codemods, dependency bumps, format-only refactors, schema migrations from a known template, and bulk i18n string updates. On these workloads the SWE-bench Pro gap mostly evaporates, the API bill is real money, and the engineer's review time per PR is bounded. I have used DeepSeek for batched test-name renames across 2,000 files and the bill was under a dollar.

When is Opus 4.7 the right cost-per-PR choice?

When the PR is ambiguous: debugging an intermittent race, threading a new abstraction through five subsystems, picking a non-obvious API design, or anything where the first plausible-looking fix is wrong. Those PRs are where retry rate destroys the cheap model's economics. They are also where a senior engineer's time matters most, because the engineer is in the loop reading the diff, not just merging it. The published $5/$25 per Mtok rate plus prompt caching gets a real Opus pipeline inside a small SaaS budget on these workloads.

What does the engineer's hour cost in this calculation?

I priced it at $150/hour for a US-based mid-to-senior engineer fully loaded (salary plus benefits plus overhead). At that rate, every 24 minutes of engineer wall-clock equals one Opus 4.7 PR. So if Opus ships in 12 minutes and DeepSeek ships in 22 minutes (because of an extra retry), the engineer-time delta is $25 and the API delta is roughly $1.50. Same direction. If your engineer is $300/hour fully loaded (senior IC at a frontier company), Opus wins by even more. If you are a single founder at $0/hour opportunity cost, the API bill is the only thing that matters and DeepSeek wins on price alone.

Does prompt caching change the picture for either model?

Yes. Both models lean on prefix caching to make agentic workloads affordable. Opus 4.7 charges 1.25x base for a 5-minute cache write and 0.1x base for a read, so the second turn in a session is already cheaper than re-sending. DeepSeek V3.2 dropped its cache-hit input price to $0.028 per million tokens on 2026-04-26, which is exactly 10 percent of cache-miss. Wire caching correctly on both sides before benchmarking; an uncached comparison is not a fair fight in either direction.

What is the actual rule of thumb you use to pick between them?

If the PR is one a junior could ship with detailed instructions, send it to DeepSeek. If the PR needs the senior to make a judgment call, send it to Opus 4.7. The dollar difference per PR is a rounding error against a $150/hour engineer; the wrong-model tax shows up as a re-rolled PR, a wasted review cycle, and worst case a regression in production that costs more than a year of API bills.

Are these numbers going to be wrong in three months?

Probably the absolute dollars, yes. The structure, no. List prices change, throughput improves, new model versions ship. The thing that does not change: per-token list price is one input among three (engineer time, retry rate, review burden), and the page that compares only the first one is misleading you. Re-run the worked example below with the day's prices and your team's loaded engineer rate. The shape of the answer holds.