AI Agent Unit Economics: Cost-per-Correct and the Compounding-Error Tax

Most cost estimates for LLM agents are wrong in the same direction: they multiply the token count of a single model call by a price-per-token and call that the unit cost. That number is off by an order of magnitude or more for any agent that takes multiple steps, because it ignores two things that dominate real spend — the cost of being wrong and the cost of re-sending context on every step of a loop.

This piece gives you two quantities to reason with instead. The first is cost-per-correct-outcome: total tokens spent divided by the number of runs that actually produced an acceptable result. The second is the compounding-error tax: the way per-step success rate collapses over a long trajectory, which is what forces retries and human review and pushes cost-per-correct far above the naive estimate. Neither is novel math, but together they explain why an agent that looks cheap per call can be expensive per result, and they tell you exactly which lever to pull.

I’ll show the mechanism, work the arithmetic with clearly-labeled illustrative numbers, anchor on real current pricing and one peer-reviewed efficiency result, and then be specific about where the model stops being trustworthy.

Cost-per-call is the wrong denominator

Take a single Claude Opus 4.8 call: roughly 8,000 input tokens, 1,500 output tokens. At the current published rate — $5 per million input, $25 per million output (verified against Anthropic’s June 2026 catalog) — that’s:

8,000  / 1e6 * $5  = $0.040
1,500  / 1e6 * $25 = $0.0375
                     -------
                     $0.0775 per call

Worth one note here, because it trips up readers anchored on older numbers: Opus-tier output pricing is now $25/MTok, not the $75/MTok that the original Opus 3 launched at in 2024. If your spreadsheet still has $15/$75 for “Opus,” it’s two generations stale and your estimates are ~3x too high. Sonnet 4.6 is $3/$15 and Haiku 4.5 is $1/$5, for reference.

So one call is about eight cents. The problem is that an agent is not one call. It is a loop: read context → call model → execute a tool → feed the result back → call the model again. A “task” might be 5 calls or 50. And the cost denominator you care about is not “per call” but “per task that succeeded,” because a task that fails still burned tokens — often more tokens, because failures tend to thrash before they give up.

Define it explicitly:

cost_per_correct = total_tokens_spent_across_all_runs * blended_price
                   ----------------------------------------------------
                            count_of_acceptable_runs

If your agent succeeds 40% of the time and each attempt (success or failure) costs $8 in tokens, your cost-per-correct is not $8 — it’s $8 / 0.40 = $20, before you’ve paid a human to check which 40% were the good ones. The success rate is in the denominator, so it has leverage the per-call price never will.

Why context re-send dominates the bill

Here’s the mechanical reason multi-step agents cost more than the sum of their model calls suggests: the API is stateless, so you re-send the entire conversation on every step. Step 1 sends the system prompt plus the user task. Step 2 sends all of that plus step 1’s tool call and its (often large) result. Step 3 sends everything again plus step 2. The input grows monotonically.

Walk a concrete trajectory. Suppose a fixed prefix — system prompt and tool definitions — of 4,000 tokens, and each step appends a tool call plus result averaging 1,200 tokens. Input tokens processed at each step:

step 1:  4,000
step 2:  5,200
step 3:  6,400
...
step 10: 14,800

Sum of input tokens over 10 steps ≈ 94,000 — versus the ~13,000 you’d get if you (wrongly) assumed each step processed only its own new content. That’s roughly a 7x gap between naive and actual input processing for a modest 10-step run, and it widens as steps grow because the re-sent prefix keeps getting longer. This is arithmetic, not a benchmark: it falls straight out of “resend the whole transcript each turn.”

The two real levers against this are both well-documented:

Prompt caching. If the growing prefix is cached, the re-sent portion bills at roughly 0.1x the base input rate instead of 1x, with a one-time ~1.25x write premium per cache entry (5-minute TTL). For a loop that re-sends a large stable prefix every step, this is the single highest-leverage change — it attacks the term that grows. The catch is that caching is a strict prefix match: any byte change anywhere in the prefix (a timestamp in the system prompt, a reordered tool list, an unsorted JSON dump) invalidates everything after it and silently drops you back to full price. Verify with cache_read_input_tokens in the usage object; if it’s zero across identical-prefix requests, a silent invalidator is eating your savings.

Trajectory / context reduction. You don’t have to re-send everything. Old tool results and completed reasoning can be cleared or summarized before the next step. The published result worth citing here is AgentDiet (Wang et al., arXiv:2509.23586, Sept 2025), which automatically removes “waste” from agent trajectories and reports a 39.9%–59.7% reduction in input tokens and 21.1%–35.9% reduction in total computational cost while holding task success constant, across two LLMs and two benchmarks. A related approach, Agentic Plan Caching (Zhou et al., arXiv:2506.14852), reuses structured plan templates across similar tasks and reports ~50% cost reduction with ~27% lower latency. These are real, peer-circulated numbers — but note they’re measured against specific agent harnesses and benchmarks; treat them as evidence the lever works and as a target range, not as a guarantee for your workload.

The compounding-error tax

Now the part that actually moves cost-per-correct: per-step success doesn’t add, it multiplies.

If each step of an agent succeeds independently with probability p, the probability the whole n-step trajectory succeeds is p^n. This is the most important equation in agent economics, and it’s brutal:

p = 0.95 per step
n = 5    →  0.95^5  ≈ 0.77   (77% of runs succeed)
n = 20   →  0.95^20 ≈ 0.36
n = 50   →  0.95^50 ≈ 0.077

A 95%-reliable step — which sounds excellent — yields a coin-flip-worse-than-even outcome by 50 steps. Drop to a still-respectable p = 0.90:

n = 5    →  0.59
n = 20   →  0.12
n = 50   →  0.005

At 90% per step, a 50-step agent essentially never completes a full run unaided.

Two honest caveats, because this is where the model is often oversold. First, steps are not independent — a good agent recovers from some errors (re-reads a file, retries a failed command), which raises effective p^n above the naive product; a bad one cascades, where one wrong step poisons the context and lowers every subsequent step’s success, making it worse than the product. The independence assumption is a useful first approximation that’s wrong in both directions depending on the harness. Second, “success” is rarely binary; partial credit muddies the curve. But the qualitative claim is robust and it’s the whole game: long-horizon reliability is dominated by step count, not by the cleverness of any single step.

This is why cost-per-correct explodes for long agents. Plug p^n into the denominator. With per-attempt cost C and a success probability s = p^n, expected cost to get one acceptable result (assuming you retry until success) is C / s. As n grows, s shrinks geometrically, so cost-per-correct grows geometrically even though per-step cost is linear in n. The error curve, not the token curve, is what bankrupts a naive agent.

The architectural implication is the actionable part. To lower cost-per-correct you have three moves, in order of leverage:

Reduce n. Fewer steps is the only thing that fights the exponent. Collapse three sequential tool calls into one composed call (programmatic tool calling, where the model writes a script that invokes tools and only the final result returns to context). Give the agent a higher-level tool instead of making it orchestrate primitives. Every step you remove multiplies through.
Raise p. Better prompts, better tool descriptions, a more capable model on the steps that matter. Note the tradeoff: a more expensive model raises C linearly but raises p — which compounds. Spending 2x per step to take p from 0.90 to 0.97 is a bargain at n = 20 (0.12 → 0.54, a 4.5x improvement in success rate for 2x the per-step cost). This is the one place where “use the more expensive model” is unambiguously correct economics.
Add recovery, not just retries. A blind retry of a failed 50-step run pays for 50 steps again. A checkpoint that lets the agent resume from step 40 pays for 10. Recovery changes the effective C in C / s without touching n or p.

A worked cost-per-correct, with the assumptions exposed

Let me put it together with numbers, and flag every figure as illustrative-model-input versus measured, because that distinction is exactly where these analyses usually cheat.

Scenario (illustrative assumptions, not measured): a 15-step agent on Opus 4.8, 4,000-token fixed prefix, ~1,200 tokens appended per step, ~600 output tokens per step, per-step success p = 0.93.

Input tokens processed across 15 steps, summing the growing prefix as above:

Σ input ≈ 15*4,000 + 1,200*(0+1+...+14)
        = 60,000 + 1,200*105
        = 60,000 + 126,000
        = 186,000 input tokens
Σ output = 15 * 600 = 9,000 output tokens

Without caching:

186,000 / 1e6 * $5  = $0.93
  9,000 / 1e6 * $25 = $0.225
                      ------
   per attempt C ≈ $1.16

Trajectory success s = 0.93^15 ≈ 0.34. So:

cost_per_correct = $1.16 / 0.34 ≈ $3.41

The naive estimate — one call’s worth, ~$0.08 — is off by ~43x. Most of that gap is the two effects this piece is about: ~7x from context re-send (the 186k vs ~27k naive input), and ~3x from the compounding-error denominator.

Now apply the two levers and watch where the savings come from. Prompt caching on the 4,000-token prefix (and the growing cached portion) realistically cuts the re-sent input to ~0.1x for the cached span. If ~140,000 of the 186,000 input tokens become cache reads, that input term drops from $0.93 to roughly 46,000/1e6*$5 + 140,000/1e6*$0.5 ≈ $0.23 + $0.07 = $0.30, plus a small one-time write premium. Per-attempt C falls to about $0.53. Trajectory reduction in AgentDiet’s measured range (call it a 40% input cut, the low end of their reported 39.9%–59.7%) trims it further. And raising p from 0.93 to 0.96 lifts s from 0.34 to 0.93^… — 0.96^15 ≈ 0.54, a 1.6x improvement in the denominator from a small reliability gain.

Stack them and cost-per-correct moves from ~$3.41 toward roughly $0.70–$0.90 — a 4–5x improvement, with the largest single contribution from caching the re-sent prefix and the second largest from raising p, not from switching to a cheaper model. That ordering is the practical takeaway: attack the growing input term and the exponent before you reach for a smaller model that lowers p and quietly raises your cost-per-correct.

(The dollar figures above are a worked illustration from stated assumptions. The only externally-measured input is the AgentDiet reduction range and the pricing. Run count_tokens and log usage on your own traffic before trusting any of the per-attempt numbers — your prefix size, step count, and p are what determine the answer, and all three are workload-specific.)

Where this model breaks

Be skeptical of your own cost-per-correct number in these cases:

You can’t define “correct” cheaply. The whole framework assumes you can label a run acceptable/unacceptable. If verification requires a human who costs more than the tokens, the denominator is dominated by review labor, not inference — and you should be modeling review cost, not token cost. The honest version of these analyses adds a human_review_cost term; I’ve left it out of the arithmetic above precisely because it’s so workload-specific that any number I put there would be invented.
Steps are strongly correlated. When errors cascade (one bad step corrupts context for all downstream steps), p^n understates the collapse and your real cost-per-correct is worse than the model predicts. When the agent recovers well, p^n overstates it. Measure the actual full-run success rate; don’t infer it from per-step rates.
Output, not input, dominates. If your agent generates long artifacts (code, reports) rather than orchestrating tools, output tokens at the higher rate can swamp the re-send effect, and caching — which only helps input — won’t move your bill. Check the input/output split before assuming caching is the answer.
Unattended runs with no spend ceiling. An agent in a retry loop with no cap can run up a bill that has nothing to do with per-correct economics and everything to do with a missing guardrail. Set a hard token budget (max_tokens, or a task-level budget the model is aware of) before you optimize anything else.

The discipline is simple to state and easy to skip: instrument total tokens and acceptable-run count, compute cost-per-correct from real logs, and identify whether your spend is dominated by the re-sent prefix, the output, or the failure rate. Each diagnosis points at a different fix. The naive per-call estimate points at none of them.

Cost-per-call is the wrong denominator

Why context re-send dominates the bill

The compounding-error tax

A worked cost-per-correct, with the assumptions exposed

Where this model breaks

Tell us what's broken.