AI prediction market backtest vs live. Why the curve flattens.

Every team I have helped on Polymarket or Kalshi has the same graph: a beautiful backtest equity curve, then a live curve that crawls along the floor. The instinct is to retrain the model. The answer is almost always engineering. Backtests assume instant fills at the quoted mid, infinite liquidity, instant settlement, and zero adverse selection. Live, none of those are true. Below is the gap, line by line, and what to build to close it.

M
Matthew Diakonov
8 min read

Direct answer, verified 2026-05-04

The gap is mechanism, not model

An AI prediction market backtest beats live because the simulator silently assumes away four things: bid-ask slippage on a thin single-LP book, oracle resolution lag (UMA dispute window on Polymarket, end-of-period settlement on Kalshi), exchange position caps that compress the size at which your edge exists, and adverse selection from a news flow that reaches everyone else first. Together they typically erase 30 to 80 percent of the backtest expected return on a real strategy. None of it is fixed by retraining. All of it is fixed by infrastructure.

Sources: Polymarket docs, Kalshi trading API docs, UMA optimistic oracle docs.

Five assumptions your notebook quietly made

None of these are wrong on purpose. They are the defaults of every backtest framework I have read. The problem is each one maps to a real number you eat in production, and they compound.

FeatureBacktest (notebook)Live (Polymarket / Kalshi)
Fill priceQuoted mid or last trade, assumed instantCross a thin single-LP book; 50 to 300 bps of slippage at the size where your edge exists
Fill sizeWhatever you ordered, in fullCapped by visible depth, partial fills, rest exposed to adverse selection while quoted
Settlement timingAt the moment of the resolving eventPolymarket: UMA proposal liveness + dispute window (hours, longer if challenged). Kalshi: end of contract period, not news event
Position sizeUnbounded, or whatever your notional cap isPer-market exchange caps (lower on Kalshi for retail) plus your own oracle-dispute risk budget
News flowSynchronous: model sees price and news togetherAsynchronous: by the time your model reads the headline, faster bots already moved the price
Gas / feesOften zero in the simulatorPolymarket: Polygon gas per quote and cancel; on a 200-requote-a-day strategy this dominates a thin edge
Dispute riskOutcome is whatever your label saysAmbiguous outcomes can be re-proposed, escalated, or resolved against your read of the rules

Numbers are typical ranges I have measured on real Polymarket and Kalshi books; they vary by market depth and event volatility.

What the simulator says vs what live forces you to model

The cleanest way to see the gap is to put the two execution loops side by side. Same model, same signal, same market. The left side is what almost every prediction-market backtest looks like under the hood. The right side is the loop you actually have to ship.

Same strategy, two execution loops

# what your notebook backtest actually assumes
def fill(order, market_state):
    # idealized: fill at the quoted mid, full size, instantly
    return Fill(
        price=market_state.mid,
        size=order.size,
        latency_ms=0,
    )

def settle(position, outcome):
    # idealized: settlement at outcome time, no oracle, no dispute
    return position.size * outcome.payout

def pnl_loop(model, history):
    pnl = 0
    for tick in history:
        signal = model.predict(tick)
        order = sizing(signal, max_position=infinite)
        f = fill(order, tick.market_state)
        pnl += f.value
    return pnl
-32% more lines

The live loop is not more complicated because the language is harder. It is longer because every assumption in the backtest is a real subsystem in production: the order matching, the freshness clocks, the cancellation policy, the oracle wait, the size sizing against exchange caps. None of that is novel research. All of it has to be built and watched.

One quote, end to end

What actually happens between your model emitting a signal and cash hitting your wallet on Polymarket. Five round trips, three timing risks, one place where the entire trade can be undone (oracle dispute). Backtests collapse this to a single arrow.

Live order lifecycle on Polymarket

modelagentpolygonuma oraclesignal: yes 0.62size against caps + risk budgetplace quote (gas paid)ack + book statefreshness clock runningcancel on stale sourcefill at 0.61wait: oracle proposeoutcome proposeddispute window passessettle USDC

Replace UMA with the Kalshi end-of-period settlement and remove gas, and you have the Kalshi version. The shape is the same: many steps where your backtest had one.

~60%

Typical recovery of backtest expected return after a serious infra pass on a Polymarket strategy. The remaining gap is honest, mostly settlement risk and adverse selection that cannot be removed, only priced.

Measured across two recent client engagements, walk-forward over 8 weeks each

What is missing from your backtest

The closer your simulator gets to this list, the smaller your live-vs-backtest gap. Each item maps to a piece of code that takes hours to days, not weeks. They are also the items that retail quant content systematically skips, because they are mechanism-specific and unsexy.

Build before you trade

  • Quote-freshness clock per data source with a hard kill switch on stale signal
  • Position sizer that respects exchange caps (Kalshi per-contract notional, Polymarket per-market) plus your own oracle-dispute budget
  • Fill simulator that replays your historical quotes against the actual visible book, not the mid
  • News-source watchdog that pauses quoting when realized vol on the underlying spikes above your model's training distribution
  • Oracle-aware PnL accounting that does not mark-to-fantasy on outcomes pending UMA proposal or Kalshi end-of-period
  • Cancel-batching policy on Polygon so gas does not eat the spread on requote-heavy markets
  • Reconciliation loop that compares every live fill to model intent within 60 seconds and alerts on drift

Why a senior engineer beats a quant course on this work

Most paid prediction-market content sells alpha. Find the edge, scale it, retire. The work above is none of that. It is plain production engineering: clocks, kill switches, simulators, reconciliation. The kind of thing a fifteen-year cross-platform engineer who has shipped real-time IoT systems and on-chain blockchain workflows does without thinking about it. The kind of thing a course operator has never built.

I publish my consult rate at $75 and project tiers at $500 to $10K+ on the homepage because the work survives being seen. If you already have a model that works in backtest and you want it live on Polymarket or Kalshi without losing the edge to mechanism friction, the build above fits inside the Custom System tier. If you are still figuring out whether your model has an edge at all, a single $75 consult is enough to tell you which side of that line you are on, and what to do next.

Bring your backtest curve, your live curve, and your last 200 fills

$75 consult. I will tell you in 30 minutes whether the gap is your model, your fill simulator, or your news-flow latency, and roughly what closing it costs.

Frequently asked questions

Why does my AI prediction market backtest beat live by so much?

Four reasons in roughly this order. One: your backtest fills at the quoted mid or last trade, but live you cross a thin single-LP order book and pay 50 to 300 bps of slippage on the kind of size where your edge actually exists. Two: backtests collapse settlement to the moment of the resolving event, but Polymarket waits on a UMA optimistic oracle (proposal liveness plus dispute window, typically a few hours, longer if challenged) and Kalshi resolves at the end of the contract period, not at the news event. Three: position caps. Kalshi limits retail accounts to a notional position size per contract that is small relative to the size your simulator assumed. Four: adverse selection. By the time your model rereads the headline, four humans and one faster bot have already moved the price.

How big is the gap, typically?

On a model with a real edge in backtest, the live drawdown of expected return is usually 30 to 80 percent before you do any infrastructure work. About half of that is bid-ask plus oracle-delay carry, the other half is adverse selection and position-sizing limits. After serious infra work (latency-aware quoting, news kill switches, inventory caps tuned to mechanism), I have seen teams recover roughly two thirds of the simulator number on Polymarket-style markets. The remaining gap is honest, not a bug. Resolution risk does not disappear, it can only be priced.

What is the single biggest fix?

A latency-aware quoter that refuses to leave a quote on the book once the underlying news source has moved. Most retail prediction-market bots quote a static spread around their model price and get adversely selected on every news tick. The fix is one engineering primitive: a freshness clock per data source, a hard cancel as soon as any source clock exceeds your stale-quote SLO, and a backoff before the next requote so you do not chase the move. On thin books this single change recovers more expected value than any model improvement under five percent accuracy.

Does this apply to Kalshi the same way as Polymarket?

Mostly. The mechanism realities differ: Kalshi is a CFTC-regulated exchange with USD funding and centralized matching, Polymarket is on Polygon with USDC and a central-limit order book contract. Kalshi has lower settlement risk because there is no oracle dispute window, but it has lower notional caps, fewer markets, and a narrower set of tradable events. Polymarket has higher notional capacity per market and broader event coverage, but you pay gas, you wait on UMA resolution, and you face dispute risk on ambiguous outcomes. Different friction, same shape: what your backtest assumed away.

Can I just use a tighter spread in backtest?

It is necessary, not sufficient. A tighter spread captures the average cost-to-fill on a representative day, but live execution is not average. The percentile that matters is the right tail: the quote you got hit on right before the resolving news. A model that backtests with a 200 bps round-trip and lives at a 250 bps round-trip on the median may still bleed on the 95th percentile of fills, where your last trade was the worst trade. The fix is to model fills with a quote-aware adverse-selection penalty, not just a fixed spread.

What about gas, on Polymarket specifically?

Polygon gas is small per transaction (cents to a few dollars depending on congestion), which sounds negligible. The trap is round trips. A market-making strategy that requotes 200 times a day on a market where you ultimately net 50 cents of edge per round trip will spend more on gas than it earns, even at Polygon prices. Batch your cancels, lean on Polymarket order types that let you reprice without re-signing, and only requote when your model has actually moved more than a hard threshold. This is a 50-line policy change that turns a losing bot into a flat one.

What infrastructure do I actually need to run a live AI prediction market book?

Five things, in order of how much you bleed without them. A quote-freshness clock and stale-quote kill switch (fixes adverse selection). A position-cap-aware sizer that respects exchange limits and your own oracle-risk budget per market (fixes blowup risk on disputed resolutions). A news-source watchdog that pauses quoting on big moves (fixes the right-tail loss). A real fill simulator that replays your historical quotes against the actual book, not the mid (fixes calibration). A reconciliation loop that checks every fill against your model's intent within a minute and screams when they diverge (fixes silent drift). Skip any of these and the rest of the work is decoration.

How do you scope a build like this for a small team?

Honestly, depending on whether you already have a working model. If you have a backtested model and need to ship it live, the work is a fixed-scope two to four week build: quoting layer, kill switches, position sizer, fill simulator. That fits the $2,000 to $10,000+ Custom System tier on the c0nsl pricing page. If you are starting from a notebook with no model yet, the engineering work waits behind the model work, and we do a $75 consult first to figure out what you actually have. I do not sell a prediction-market alpha course. I build the boring infrastructure between your model and the venue.

Is there a published benchmark I can compare my numbers to?

Not a clean one. Polymarket publishes per-market volume and there are public dashboards (Polymarket Analytics, Dune queries) that show the realized spread and depth on top markets, but neither tells you the live PnL of an AI quoting strategy because it is not posted. The best honest benchmark is your own walk-forward replay with the fill simulator described above. If your simulator agrees with live within a percentage point on a two-week window, you can trust your future backtests for sizing decisions. If it disagrees by more than three points, your simulator is the bug, not the model.