Why Your AI Bill Rises as Token Prices Fall: Jevons in LLM Ops

Two numbers from 2025 sit in flat contradiction. The Stanford HAI 2025 AI Index (Ch.1, R&D) reports that the per-token cost of inference at a fixed capability level — GPT-3.5-class, 64.8 on MMLU — fell roughly 280-fold in about 18 months: from $20.00 per million tokens (GPT-3.5, Nov 2022) to $0.07 per million (Gemini-1.5-Flash-8B, Oct 2024). Over the same window, Menlo Ventures’ 2025 Mid-Year LLM Market Update found enterprise model-API spend roughly doubled in six months — $3.5B to $8.4B — and Menlo’s full-year report puts total enterprise generative-AI spend at ~$11.5B (2024) rising to $37B (2025). Unit price collapsing, total spend climbing. The lines go opposite directions.

If you operate LLM workloads, you have probably already lived the small version of this. A developer comes back from a long weekend to a $4,200 bill. A 429 rate-limit retry loop quietly accumulates 847,000 API calls and ~$3,847 in charges before a 402 Payment Required stops it (SupraWall). A Claude Max user racks up $1,800 in two days from overnight Claude Code runs. In every case the agent did exactly what it was told — retry until success, reason until done — with no hard stop. The sticker price had nothing to do with it.

This is not a billing glitch or a phase you wait out. It is structural, and it has a name.

The economics: Jevons, properly explained

In 1865, W. S. Jevons published The Coal Question and observed something counterintuitive: as steam engines became more efficient at turning coal into work, Britain burned more coal, not less. Efficiency lowered the effective price of mechanical work, which expanded the range of things worth doing with it, which grew total demand faster than efficiency reduced per-unit consumption. The improvement caused the consumption.

Total spend is just price × volume. Jevons’ paradox is the claim that when price falls and demand is elastic, volume rises faster than price drops, so the product goes up. The paradox only bites under that elasticity condition — if demand were fixed, cheaper tokens would simply lower your bill. For LLMs in 2025–2026, demand is not fixed. It is wildly elastic, and the mechanism is more interesting than “people just use more.”

a16z’s Guido Appenzeller named the supply side “LLMflation”: for a model of equivalent performance, inference cost drops ~10x per year, ~1000x over three years — faster than Moore’s Law. A GPT-3-quality model went from ~$60/M tokens (2021) to ~$0.06/M (2024). The non-obvious part is the demand side. Cheaper tokens don’t just mean more of the same workload; they endogenously change how you build. At $20/M you write a single tight prompt and ship the first answer. At $0.07/M you can afford to let a model think for 10,000 hidden tokens, re-read a 100K-token context on every step, and run fifteen agents in parallel — so you do. The price drop is what unlocks the deep-reasoning, large-context, multi-agent architecture that then consumes the tokens. The consumption pattern that raises your bill is caused by the price drop that was supposed to lower it.

This is why Communications of the ACM frames tokens as a utility we keep treating like software. Software has near-zero marginal cost; you provision it once. Tokens are metered consumption, like electricity, and every architectural decision is a thermostat setting. Menlo’s data shows the shift is already done: 74% of builders say most of their workloads are now inference (up from 48% a year prior), and 49% of large enterprises say most or all of their compute is inference (up from 29%).

The real cost drivers, ranked and quantified

“Agentic loops are expensive” is true and useless. Here is the mechanism, because the math compounds in ways the sticker price never shows.

1. Agentic loops and quadratic context

The structural killer is that LLM APIs are stateless and re-bill the entire conversation history on every call. An agent does not pay for “the new tokens this step.” On step N it pays for the system prompt, the tool schemas, every prior tool result, and the running transcript — all of it, again. An N-step agent that accumulates tool output therefore pays roughly O(N²) in cumulative input tokens. By iteration 15, a context can be ~80K tokens of accumulated state, re-sent on every subsequent call (Augment Code).

The multipliers are measured, not guessed. Anthropic’s own engineering data (via Unblocked) shows a single agent uses ~4x the tokens of a chat interaction; a multi-agent system uses ~15x. A 10-cycle Reflexion-style reflection loop can consume ~50x a single linear pass. Gartner’s cited range (Mar 2026, via oplexa) is that agentic patterns consume 5–30x the tokens of a standard chatbot per task. A rough hierarchy to budget against: plain chatbot = 1x; RAG query = 3–5x; single-step agent = 5–10x; multi-step loop = 10–20x; always-on monitoring agents, unbounded.

The same Anthropic research found token usage alone explains ~80% of the performance variance in their multi-agent system. Read that plainly: you are paying for quality in tokens. The expensive runs are often the good runs, which is exactly why naive cost-cutting degrades output and why you need outcome-based metrics rather than raw token caps.

2. Reasoning tokens — the invisible markup

The most decision-changing fact most pages omit: reasoning models bill “thinking” tokens as output, but never show them in the response. A simple query might burn a few hundred; complex planning burns 10,000+. A reply with 500 visible tokens can bill 2,000–5,000 total (EG3). Rule of thumb: multiply expected output by 3–5x for reasoning models.

This inverts pricing-table comparisons. o3 at $2/$8 per million in/out looks cheaper than a GPT-5.x at $2.50/$5 — until you add 3,000–10,000 invisible thinking tokens per request, which makes it effectively 3–10x more expensive per task than the sticker implies (TokenMix). The cheaper-looking model is frequently the costlier one. If you select models off the published input/output rates without modeling reasoning overhead, your forecast is wrong before you ship a line of code.

3. Context inflation you pay before the user types

In agentic setups, the tool/function-call definitions alone can consume 30–60% of context on every call (Edwin Lisowski). RAG inflates context another 3–5x. These are fixed per-call overheads — and per the quadratic point above, a fixed overhead multiplied by every loop iteration is not fixed at all. Twenty tools you registered “just in case” are a tax on every step of every run.

4. Always-on agents

Monitoring and background agents that run 24/7 have no user-session boundary to cap them. They are the highest-consumption pattern precisely because nobody is watching the meter in real time. The weekend-bill anecdotes are all this category.

Why your estimate is always wrong: cost is a distribution

Here is the failure mode behind most blown budgets. Cost per agentic task is not a number; it is a heavy-tailed distribution. On SWE-bench Verified, identical coding tasks varied by up to 30x in total tokens across runs, and a full agentic run consumed ~1000x the tokens of an ordinary code-completion chat (Unblocked). Same task, same prompt, 30x spread, because the agent’s path is non-deterministic — it backtracks, re-reads, second-guesses.

If you budget on the mean, you are guaranteed to blow P95 and P99. A workflow averaging $0.40/run with a P99 of $12/run will look healthy on the dashboard and bankrupt you on the tail. Budget and alert on percentiles, not averages. Track the P95/P99 of cost-per-run per workflow, and treat the tail as the real number — because under load, the tail is where your money goes.

Instrument it: measure cost-per-outcome, not cost-per-token

Per-token price is the wrong denominator. The business does not buy tokens; it buys accepted outcomes — a merged PR, a resolved ticket, a correct extraction. The metric that matters, from InfoWorld’s FinOps for agents, is CAPO — Cost per Accepted Outcome:

CAPO = (total cost of ALL runs, successful + failed) / (count of ACCEPTED outcomes only)

The numerator includes failures, retries, and abandoned runs — every token you paid for. The denominator counts only outcomes a human or downstream system actually accepted. This is brutal and correct: a workflow that succeeds 40% of the time at $0.50/run has a CAPO of $1.25, not $0.50, and a retry-storm workflow can have a CAPO an order of magnitude above its per-run cost. Track CAPO at median, P95, and P99.

Pair it with Failure Cost Share:

Failure Cost Share = (cost of failed/rejected runs) / (total cost)

This single ratio tells you where the money bleeds. High failure-cost-share means you are paying mostly for runs nobody used — retry storms, expensive timeouts, low acceptance — and the fix is reliability, not a cheaper model.

To compute either, instrument every run: tag it with tenant, workflow, model, and outcome_state (accepted / rejected / errored / timed-out); count tokens authoritatively from the API response usage field (including reasoning tokens — do not infer from output length); and attribute cost per feature. Most teams cannot do this today: the FinOps Foundation’s 2026 data (via optimumpartners) shows only ~51% of orgs can even measure AI ROI, 73% exceeded their original AI cost projections, and 98% of FinOps practitioners now manage AI spend (up from 31% in 2025). The measurement gap is the cost problem.

Cap it: five gateway-enforced guardrails

Caps belong at the gateway — the proxy every model call passes through — not scattered in application code where one forgotten retry loop reintroduces the failure. InfoWorld’s canonical checklist:

Loop / step limit. Cap planning, reflection, and verification cycles per run. When the cap is hit, escalate or fail loudly — do not silently continue. This is the single guardrail that would have stopped every weekend-runaway anecdote.
Tool-call cap. Cap total paid actions per run, with stricter sub-caps for expensive tools (web search, long-running automations, anything that itself calls a model). A run should not be able to fire 200 searches because the planner got stuck.
Per-run token budget. A hard token ceiling per run. Critically, summarize history instead of re-sending the full transcript — this directly attacks the O(N²) quadratic by flattening accumulated context back to a bounded summary.
Wall-clock timeout. Long work goes to a background job with its own budget; a synchronous request should never be able to run for twenty minutes accumulating tokens.
Per-tenant budgets + concurrency limits. Per-tenant ceilings with FinOps anomaly alerts cap the blast radius. One tenant’s runaway loop must not drain the shared budget.

On top of these, build explicit hard-stop patterns: a tool wrapper that raises after N calls; a circuit breaker on 402/429 responses that kills the loop rather than backing off and retrying forever. The 847,000-call incident happened because the retry logic treated 429 as “try again” with no global counter. A circuit breaker that trips after, say, 50 consecutive rate-limit errors turns a $3,847 weekend into a $5 alert.

Architectural levers that actually move the bill

Caps prevent disasters; architecture lowers the steady-state. The measured levers (oplexa, Unblocked):

Model routing to the smallest capable model: 60–80% reduction. Single-model deployments cost ~87% more than tiered ones — $18.40 vs $2.31 per million effective tokens. Enterprises ran ~4.7 models on average in 2026 (up from 2.1 in 2024) precisely to capture this. Most requests do not need the frontier model; route by difficulty.
Context trimming: 20–40%. A controlled curated-context test cut tokens 42% and tool calls 64% versus unbounded context. Stop shipping every tool and every retrieved chunk on every call.
Semantic + prompt caching: 30–50% fewer effective calls — but only when the math works. Anthropic bills cache reads at 0.10x normal input (90% off) and cache writes at 1.25x; OpenAI’s GPT-5.x dropped cached input to $0.50/M vs $5/M, with automatic caching above a 1,024-token stable prefix and no separate write fee (Finout, ofox). The break-even on Anthropic is concrete: a cache write costs 0.25x extra (1.25x − 1.00x); each read saves 0.90x (1.00x − 0.10x). So caching pays off once your stable prefix is reused more than ~0.28 times beyond the write — effectively, the second hit on the prefix already wins. It loses only when prefixes are unique per call (no reuse) — then you pay the 1.25x write surcharge for nothing. Cache the system prompt + tool schemas + stable RAG context; never cache the volatile tail.

A planner/executor split — a cheap model decides what to do, an expensive model does only the hard step — stacks routing and context-trimming gains together and is usually the highest-leverage refactor for a loop that has gotten fat.

The strategic takeaway: don’t wait it out

The instinct is to wait for prices to fall further. They will — IDC forecasts ~10x growth in agent usage and ~1000x growth in inference demand by 2027; Goldman projects token demand rising ~24x if agentic patterns persist. And your bill will rise anyway, because that demand growth is the Jevons mechanism, not a counterweight to it. The same report stack that promises cheaper tokens promises more of them than any price decline can offset.

Meanwhile Gartner predicts more than 40% of agentic AI projects will be canceled by end of 2027, largely on escalating cost and unclear value. The projects that survive will be the ones that stopped managing the wrong variable. Per-token price is set by your vendor and falling on its own; you cannot manage it and you do not need to. What you manage is tokens-per-outcome and the loop. Cap the cycles, summarize the history, route to the small model, track CAPO at P99, and put a circuit breaker between your agent and your invoice.

Numbers to distrust. The viral “$1.2M → $7M average enterprise AI budget” figure circulates widely with no primary source — it appears to be a blog meme, not a report; do not cite it. Be wary of round “$20 → $0.40” or “1,000x” claims given without attribution. Anchor to primaries: Stanford AI Index ($20.00 → $0.07, with dates), a16z LLMflation (10x/yr), Menlo ($3.5B → $8.4B; $11.5B → $37B). When a number has no named source and date, treat it as decoration.