AI model migration scaffolding consulting, the five artifacts to build first.

Most teams that hire for an AI model migration ask the wrong question. They ask “can you rewrite our prompts for the new model.” The right question is “can you give us a way to measure whether the new model is better.” The first one is a week of guessing. The second one is a piece of scaffolding you build once and reuse for every future migration.

Matthew Diakonov, Written with AI

Published April 30, 202611 min read

Direct answer, verified 2026-04-30

What this engagement actually is

A fixed-scope engagement that builds five reusable migration artifacts before anyone touches production prompts. Once those artifacts exist, the prompt rewriting itself is a normal pull request, and every future model swap on the same stack takes days instead of weeks. Pricing on c0nsl is on the page:

Scoping consult

$75

30 to 60 min

First migration with scaffolding

$2K to $10K+

Custom System tier, fixed scope

Each later migration on the same harness

$500 to $2K

Small Integration tier, fixed scope

Source: the published rate ladder on c0nsl.com. No discovery deck, no rate held back for the call.

The thesis: migration is scaffolding, not rewriting

Every model swap I have shipped, going back to the days of switching from text-davinci to GPT-3.5, runs into the same wall. The new model is mostly better, slightly worse on a few things, and behaves differently on a long tail of inputs the team has not looked at since the prompt was written. If the team rewrites the prompt to chase the obvious wins, they ship a regression on the tail and find out when a customer complains.

The fix is not a smarter prompt. The fix is a piece of harness that sits beside the production stack, runs both models on the same inputs, scores the diffs, and gates the cutover. That harness is what scaffolding-first consulting actually delivers. Without it, you are doing migration by vibes. With it, the migration is a CSV you read like a code review.

The rest of this page names the five artifacts in the order I build them, what each one is, and how they hand off to the next.

How the artifacts connect

Before the per-artifact detail, the bird’s-eye flow. Live requests fan out to two paths. The production model serves the user. The shadow gate sends a sampled copy to the candidate model. Both responses land in the diff scorer, which compares them against the golden fixture set and the per-fixture rubric. The scorer’s output gates the rollback flag. Cutover is the moment you flip the flag.

Migration scaffolding, end to end

Nothing in this picture is novel infrastructure. It is glue, and the glue is the work. Every artifact below is a small piece of this same diagram, scoped to fit a small team’s codebase.

Artifact 1: the golden fixture set

Capture 200 to 1,000 real production inputs with their accepted outputs, redact the PII, and commit the result to the repo (or to a private object store if your data is too sensitive to live in git). This is the artifact that makes everything else possible. Without a fixture set you are guessing about behavior. With one you can re-run any prompt change against the same 1,000 cases in a minute.

The capture itself is a one-time script. Sample from production logs across at least one full business cycle (a week is the floor; a month is better if traffic varies a lot day to day). Stratify by route or use case so the rare paths are not lost in the average. Tag each fixture with the verdict the current model gave so the scorer has a baseline to compare against. The redaction layer is non-negotiable: emails, phone numbers, full names, addresses, any field your privacy policy treats as PII. Replace with deterministic stand-ins so the prompt structure is preserved but the identifying content is not.

capture_golden_set.py

Artifact 2: the parallel runner

A small async script that takes two model IDs and the fixture file, calls both models on every input, and writes the responses side by side into a results file. That is it. The runner is stateless, has no business logic in it, and lives in your repo as a pytest job or a CLI command. The point of having it as its own artifact is that it works for the next migration too: same script, different model IDs.

parallel_runner.py

Artifact 3: the diff scorer

The scorer is where most teams underbuild. A naive scorer is a string compare, which is the wrong tool for almost any LLM output. The right scorer runs three layers. First, a structural check: did the JSON parse, are the required fields present, do tool calls have the right shape. Second, a semantic check: does the candidate output mean the same thing as the baseline. For short outputs, heuristics (length ratio, key-noun overlap) are enough. For longer outputs, a third model with a strict rubric does the judging. Third, a regression flag for known-hard fixtures: any fixture that was added because of a previous bug fix gets re-checked against the exact behavior that was fixed.

What the team has been doing without scaffolding usually looks like the left side. What the scorer does for them looks like the right side. Same data, very different signal.

Reading the diff: ad-hoc vs scored

# Engineer pastes 30 outputs into a doc, eyeballs them
# at 11 PM the night before the cutover.

> looks ok
> looks ok
> hmm shorter, but probably fine
> looks ok
> can't tell
> looks ok
# ship it

-22% fewer lines

Artifact 4: the shadow-traffic gate

The fixture set tells you about behavior on captured inputs. The shadow gate tells you about behavior on the inputs you have not seen yet. It is a sampled mirror: a configurable percentage of live requests (5% is the usual starting point) gets a fire-and- forget call to the candidate model. The user never sees the shadow response. The diff scorer reads it asynchronously and writes it into the same CSV that the offline harness produces.

Two things matter about the shadow gate. First, it stays sampled and short-circuited so the candidate-model bill stays small. No agent loops, no tool retries, no follow-up calls. For a small team running a $3K-per-month main inference bill, the shadow line item typically lands at $50 to $150 per month over a one to two week migration window, then drops to zero when the gate is closed after cutover. Second, the gate is built to be deleted. After cutover the entire shadow path goes away in a single PR.

“Default sample rate for the shadow gate. Big enough to surface real-world drift in a week of traffic, small enough that the candidate-model bill stays inside two figures a day for most small teams.”

Default in the c0nsl scaffolding template, tuned per-route during scoping

Artifact 5: the one-line rollback flag

Cutover is not a brave decision. It is a config change. The rollback flag is one boolean (or a string holding the active model ID) that the production code reads at the top of every inference call. If the harness flags a regression after cutover, you flip the flag back, the next request goes to the previous model, and you have time to look at the CSV without a war room.

The reason this is a named artifact and not an afterthought is that without it, every “migration” is actually a point-of-no-return code change with the model ID hardcoded in seven files. The first half-day of the engagement usually goes into routing every model call through one adapter so the flag actually has somewhere to live. After that, the flag itself is three lines.

model_adapter.py

What it looks like when the harness runs

The five artifacts together feel like a normal CI pipeline once they are in place. You change the candidate model ID, re-run the harness, read the CSV, ship a small prompt change if needed, re-run, open the shadow gate, watch the live diff over a few days, and flip the rollback flag in the dashboard. The whole thing reads like the demo below.

A scaffolded migration, end to end

mk0r preview

$ python parallel_runner.py --baseline opus-4-6 --candidate opus-4-7 Ran 1,000 fixtures: baseline=opus-4-6 candidate=opus-4-7

1/5Step 1. Run both models on the captured fixture set. About two minutes wall time on a small workload.

Pricing reality check

The reason this work fits inside a published rate ladder and not a six-figure consulting agreement is that the artifacts are small. None of them is novel. The value is in scoping, naming, and ordering them so a small team without an MLOps function can actually own the result.

For a typical engagement (one to three prompt surfaces, an existing FastAPI or Next.js stack, a non-regulated domain), the first migration with full scaffolding lands in the $2,000 to $6,000 range and takes two to four weeks of calendar time, with most of mine running closer to one week of focused work. For larger surfaces, hard-regulated data (clinics, law firms, anything where redacted text is still sensitive), or stacks where the model ID is hardcoded in many files, the price moves up the Custom System tier toward $10K+. Every subsequent migration on the same harness, including the same-team move from Opus 4.7 to whatever ships next, fits the Small Integration tier at $500 to $2,000, because the artifacts are reused.

The reason to publish this on a page rather than gate it behind a discovery call is the same reason every tier on c0nsl is on the homepage: a small operator should be able to do the ROI math themselves before booking a $75 consult, not after.

What this engagement is not

It is not an LLMOps platform sale. It is not a course on evaluation harnesses. It is not a one-page checklist you can print and hand to a junior engineer. It is a senior engineer sitting inside your codebase for a small fixed window, scoping five concrete pieces of glue to your stack, your data, and your redaction rules, and leaving you with a harness your team can run for the next model that ships.

If your situation is bigger than that (a large model fleet across many products, a procurement-led rollout, a regulator asking about a model card), you are outside the size of team I work with and the right answer is to hire an applied-AI lead internally rather than to engage a solo consultant. That answer is on the services page too, before any consult.

Bring one prompt and a model you want to migrate to.

The $75 scoping consult is enough to size the five artifacts to your stack. You leave the call with a fixed quote, a calendar window, and the specific files I would touch first.

Frequently asked questions

What does an AI model migration scaffolding consulting engagement actually deliver?

Five reusable artifacts that live in your repo after I leave: a golden fixture set (200 to 1,000 captured production inputs with their accepted outputs), a parallel runner that calls two model versions on the same input, a diff scorer that classifies each pair as match, drift, or regression, a shadow-traffic gate that mirrors live requests to the candidate model without serving its responses, and a one-line rollback flag wired through your config. Once those five exist, the prompt rewriting that everyone else sells as 'migration consulting' becomes a normal pull request you ship in an afternoon.

Why scaffold before rewriting prompts? What goes wrong if you skip the harness?

You ship a regression you cannot see. New models change behavior on the long tail: edge-case classifications, refusal patterns, tool-call shape, response length, JSON adherence. Without a diff scorer running over a real fixture set, the regression shows up as a slow drip of customer tickets two weeks later, and you cannot tell whether it was the model swap, a prompt edit, or a downstream change. With the harness in place, the regression shows up as a row in a CSV before any user sees it. The rewriting work is the easy 20%. The 80% that protects you is the scaffolding.

How long does a scaffolding engagement take, and what does it cost at c0nsl?

First migration with full scaffolding: typically two to four weeks of calendar time, scoped inside the published Custom System tier of $2K to $10K+ on the c0nsl homepage. The range depends on how many distinct prompt surfaces you have (one classification call vs eight), whether your stack already routes through a single model adapter or has model IDs hardcoded in twelve files, and whether you need the shadow-gate piece or can get by with offline replays. Every subsequent migration on the same scaffolding (e.g. moving from Opus 4.7 to whatever ships next) usually fits the Small Integration tier of $500 to $2K, because the artifacts are reused.

Do I need an MLOps team to maintain the scaffolding after handoff?

No. The whole point of building it for a small team is that it has to run inside the same CI you already use. The diff scorer is a pytest job. The fixture set is JSON in your repo. The parallel runner is a script that takes two model IDs as arguments. The shadow gate is a feature flag and a fire-and-forget HTTP call. If your team can keep a normal test suite green, they can keep this green. The handoff doc is one page.

How is the golden fixture set captured? Won't it leak customer data?

Capture is done from production logs over a sampling window (usually a week of real traffic) into a local SQLite or JSONL file. Before that file lands in the repo, it goes through a redaction pass: emails, phone numbers, full names, addresses, and any field your privacy policy treats as PII are replaced with deterministic stand-ins so the prompt structure is preserved but the identifying content is not. For clinics and law firms where even redacted text is sensitive, the fixture set stays out of git entirely and lives in a private object store that the harness reads at run time. This is part of the scoping conversation, not a one-size answer.

What does the diff scorer actually score? Is it just a string compare?

No, string compare is the wrong tool for almost any LLM output. The scorer runs three layers: a structural check (did the JSON parse, were the required fields present, did tool calls have the right shape), a semantic check (does the candidate output mean the same thing as the baseline, judged by a third model with a strict rubric, or by a heuristic for shorter outputs), and a regression flag for known-hard fixtures (any input that was previously a bug fix gets re-checked against the exact behavior that was fixed). Each fixture comes back with a verdict and a reason. You read the CSV like a code review.

How does the shadow-traffic gate avoid double-billing me on the candidate model?

Two ways. First, the gate is sampled, not 100% mirror: usually 5% of live traffic, configurable per route. Second, the candidate-model response is not consumed for a user-facing answer, so you skip the most expensive parts: no agent loops, no tool retries, no follow-up calls. For a small team running a $3K-per-month main inference bill, the shadow line item typically lands at $50 to $150 a month for the duration of the migration window (a week or two), then drops to zero when the gate is flipped off after cutover.

What happens if the candidate model fails the harness on day three of shadow traffic?

The rollback flag is already wired, so 'fail' means a one-line config change reverts production to the previous model. The interesting work is then in the diff scorer's CSV, which tells you exactly which fixtures and which live samples regressed. Most failures cluster around two or three behaviors (the new model is more verbose, hits a JSON-mode quirk, refuses a borderline prompt the old one accepted). Each cluster becomes a small targeted prompt or schema change, you re-run the harness offline, and you reopen the shadow gate. The total cost of a failed first attempt is usually a day, not a week, because nothing in production ever changed.

Why is this not just a course or a template I can buy?

Two reasons. First, the redaction rules and the diff-scorer rubric are specific to your data and your product surface. Generic templates either redact too aggressively (so the harness no longer reflects real behavior) or too loosely (so customer data ends up in a third-party scoring model). Second, the shadow-traffic plumbing has to land inside whatever framework your service is built on, and there is no template that covers Next.js API routes, FastAPI workers, Cloud Run jobs, and Lambda handlers. A senior engineer scopes the specifics in a paid 30-minute consult, then quotes a fixed price. That is the engagement model published on the c0nsl homepage. The differentiator is that the rate is on the page before you ask.