Coding agents are two workloads, not one.
Every conversation about “the right model for AI engineering” assumes a single API path. There isn't one. There are two: interactivity-mode (a human at the keyboard, sub-2-second TTFT required) and throughput-mode (nobody is watching, a 24-hour SLA is fine, the bill is the only metric). They have opposite cost curves. Most teams collapse them into one path and overpay roughly 20x on the lane that should be running on the batch API with prompt caching stacked on top.
Direct answer, verified 2026-05-04
Which lane should this call go on?
- Trigger: human at the keyboard, IDE, live chat
- SLA: sub-2 second TTFT, sub-30 second turn
- Default model: Haiku 4.5 or Sonnet 4.6
- Escalate to Opus 4.7 only on judgment calls
- Cache: 5-minute prompt cache, mandatory
- Trigger: cron, CI, webhook, “submit and forget”
- SLA: 24 hours (typical: minutes to hours)
- Default model: Opus 4.7 with max-effort reasoning
- Cache: stacks with batch, ~$0.25/Mtok cached input
- No streaming, no real-time deadline
The whole routing decision is one boolean: is a human waiting on this token stream right now. Sources: Anthropic batch processing docs, Anthropic pricing, prompt caching docs.
The thesis: every coding-agent stack has a hidden second lane
Open the dashboard for any team that has been running a coding agent for more than a quarter and you will see a bimodal traffic distribution. A spike during business hours when engineers are at the keyboard, then a long flat tail of nightly review passes, codemod runs, eval harness jobs, doc regenerations, dependency audits, and scheduled cleanup tasks. The first set of calls is interactive. The second set is not. They are both being billed at full streaming-API list price because the same client library built for the IDE is also being called from cron.
That second lane is what most architecture posts about coding agents skip. It does not show up in the demo video. It does not get a marketing page. It quietly drives 30 to 70 percent of total agent token volume and pays the highest possible rate per token, because nobody flagged it as a separate workload. The interesting engineering question is not “which model is best.” It is “which lane is this call on, and is the call on the right lane.”
What the routing layer actually looks like
One classifier in front of every agent call, two API paths behind it. The classifier is a boolean and one optional hint. Everything else is plumbing.
One classifier, two lanes, two cost curves
The four sources on the left all look like “a coding-agent call” from outside. They are not. The top two are someone staring at the cursor. The bottom two are background work. The router decides on one input (is a human waiting) and dispatches to one of two API paths. The streaming path runs at full list price because nothing else gets you sub-2-second TTFT. The batch path runs at half list price because the workload was happy to wait, and stacks with prompt caching because the cached prefix survives the batch window.
The actual code, not the diagram
Two hundred lines of plumbing total in production. The interesting part is twenty lines. Below is the version I drop into a fresh client engagement. The model picker, the queue, and the result poller are obvious extensions; this is the spine.
Reading order: the boolean at line 14, the two API paths at lines 18 and 31, the comment at the bottom. Everything else is instrumentation. The decision is small. The bill it moves is not.
“Cached input on Opus 4.7 drops from $5 per million tokens on the streaming API to roughly $0.25 per million on the batch API with prompt caching wired. Same model, same answer quality, different lane. The team running review passes on the streaming surface is paying twenty times list.”
Anthropic pricing page and batch processing docs, verified 2026-05-04
What one batch lane round-trip actually looks like
The interactive lane is what every demo shows: stream tokens, the user reads them. The batch lane is harder to picture because the agent and the consumer of its output are decoupled in time. Below is the full shape of one job: submit, get a batch ID, poll, write the artifact, notify when ready.
One throughput-lane job, end to end
No human is in this picture during the run. The engineer arrives in the morning and gets a digest with 47 review passes ready to scan. The bill is half list, the model was Opus 4.7 in max-effort reasoning, and the cached prefix (project memory plus tool definitions plus a stable system prompt) was reused across all 47 requests at cache-hit rates.
Why the visible “AI consulting” layer never mentions any of this
The version of this argument you read on X is usually written by someone selling a course, a cohort, or an “AI growth partner” retainer with hidden pricing. None of those people ship a routing layer. The course playbook is generic, the cohort covers the demo path only, and a retainer that bills on outcomes has no structural incentive to drop a client's API bill by half. So the second lane stays invisible in the public discourse, and every team rebuilds the same realization in private after a quarter of streaming-API invoices.
I publish my consultation rate ($75) and my project tiers ($500 to $10,000+) on the homepage because the engineering decision should survive being seen. There is nothing proprietary about the classifier above. The value is in the implementation: which calls are actually batchable in your stack, what the cached prefix looks like in your agent loop, where the result poller writes its artifacts, and which 200-line piece of plumbing you do not have to write twice. None of that goes in a course.
How I scope this for a small team
For a five-to-fifteen-person engineering team running coding agents on a real codebase, the work is well bounded. One day to audit the last month of agent traffic and identify the second lane (it is always there, the only question is its share of spend). One day to build the router and the result poller. One day to wire prompt caching correctly on both lanes if it is not already. One day to run the eval harness on both paths and confirm there is no quality regression. That fits inside the small integration tier ($500 to $2,000) on the homepage. The work pays for itself inside a month at any team spending more than $2,000 a month on coding agents today.
For a larger setup, with multiple agent surfaces, an internal eval harness, and dollars per day in spend that justify a real queue, the work scales into the custom system tier ($2,000 to $10,000+). Everything is fixed scope. No hourly billing, no retainer, no course. The full price list is on the site, the services catalog describes each tier, and the deliverable is shipped code in your repo, not a slide deck.
Bring last month's coding-agent invoice
I will look at the bimodal traffic, identify the second lane, and either tell you the routing layer is worth the engineering days or tell you it is not. $75 for the call. No course.
Frequently asked questions
What is the actual difference between throughput-mode and interactivity-mode for a coding agent?
Interactivity-mode is when a human is staring at a token stream waiting to react. The metrics that matter are time-to-first-token (target sub-2 seconds), inter-token latency (target 80+ tokens/second perceived), and the cost of a retry the user kicks off because they got bored. Throughput-mode is when nobody is watching. The metrics that matter are total tokens processed per dollar and total wall-clock to a job-complete callback. The two have opposite cost curves: interactivity pays a premium for low TTFT, throughput pays the lowest possible per-token rate and is fine waiting up to 24 hours for an answer.
Where does Anthropic's batch API fit?
Anthropic's Message Batches API runs requests asynchronously inside a 24-hour SLA at exactly 50 percent of the standard input and output rates, on every model including Opus 4.7. The 50 percent batch discount stacks with the 5-minute prompt caching discount: a cache read is normally 0.1x the base rate (so $0.50 per million tokens on Opus 4.7), and on the batch API that read becomes 0.05x base, roughly $0.25 per million tokens. That is a 20x reduction from the $5 per million list rate. The catch is that batch jobs return when they return; you cannot stream and you cannot pin the response under a real-time deadline. Anything where the user is at the keyboard fails immediately.
What kinds of coding-agent work belong on the batch lane?
Anything you can run overnight or at the end of the workday and review the next morning. Concrete examples: codemods across thousands of files, nightly review passes on every open PR, regenerating an embedding index over a repo, running an eval harness against a new model version, generating release notes from the day's commits, bulk porting test names, security review of dependencies after a major bump, regenerating boilerplate from a spec, regenerating typed API clients after an OpenAPI change, generating per-file documentation across a monorepo. None of these need a human in the loop during the run; all of them are willing to wait hours for a 50 percent off bill.
What kinds of coding-agent work belong on the interactive lane?
Anything where a human is in the loop right now. Claude Code at the keyboard. Cursor inline completion. A live debugging session. Code review where the engineer is reading the diff as it streams. A pair-programming voice agent. Anything where TTFT above 5 seconds breaks the user's flow and where any retry the model kicks off is paid in human attention. The interactive lane runs on the streaming API at full list price; the small win here is picking a fast model (Haiku 4.5, Sonnet 4.6, DeepSeek V3.2) for low-stakes turns and reserving Opus 4.7 only for the tricky judgment calls.
What is the actual cost gap between collapsing the two lanes vs splitting them?
On a 250,000-token cached prefix, an 8,000-token fresh input, and a 12,000-token output, one Opus 4.7 turn on the interactive surface lands at roughly $1.90 with prompt caching wired correctly. The same turn on the batch API with the same caching lands at roughly $0.10 to $0.25 depending on whether you can pin the cache write across the batch window. A team running 200 batchable jobs per day on the interactive surface (review passes, eval runs, codemods, doc regen) is burning roughly $380 per day on a workload that should cost $20 to $50. Across a quarter, that is a six-figure rounding error.
Why doesn't every team already do this routing?
Three reasons. First, the routing layer has to be written; it is not a setting you toggle. Second, batch jobs need a results poller and a queue, which is real infrastructure most coding-agent stacks did not build because Claude Code, Aider, and Cursor are all interactive-first. Third, most of the visible AI-for-engineering content online is sold by course operators and 'AI growth partners' who have never shipped a routing layer in production and do not write about it because it is not in their playbook. The result is that a four-line decision (is a human watching this stream right now, yes or no) gets skipped, and the bill quietly inflates.
Does prompt caching matter on the interactive lane too?
Yes, more than people think. The interactive surface in a Claude Code style loop is dominated by a long, slowly-mutating cached prefix (tool definitions, file reads, accumulated diffs, system prompt, skills). With 5-minute prompt caching wired correctly, the second turn in a session is already cheaper than the first because reads are 0.1x base. Without it, you pay full input rate on a 250K-token prefix every turn, and a single multi-turn session can cost more than a small batch job on the same task. Wire caching first; only after that does the throughput vs interactivity routing decision actually move money.
What is the right architecture for a small team that wants to do this without overbuilding?
A four-line router in front of the agent loop. If the call is initiated from a streaming surface (CLI prompt, IDE extension, web chat) and the user is waiting, send it to the streaming API at list price with prompt caching on. If the call is initiated from a cron, a CI hook, a webhook, or a 'submit and forget' UX with a results page, queue it for the batch API. Hold the batch ID, poll the results endpoint, write the output back to a per-job artifact. Maybe 200 lines of Go or Python. Once the router exists you can vary the model per lane (Haiku 4.5 for low-stakes interactive, Opus 4.7 for everything else) without touching the agent loop.
When is it worth NOT splitting the lanes?
When the total monthly API spend on coding agents is below roughly $500 a month. The savings from routing are real but they have a fixed engineering cost (the router, the queue, the monitoring) and below that threshold the engineer-hours to ship the router beat the API savings for a year. Above $500 a month and especially above $2,000 a month, the router pays for itself inside a quarter. This is the same threshold where prompt caching wiring also stops being optional.