Open-Weight vs Closed Models in 2026: When Self-Hosting Actually Pays Off
The question “should we run an open-weight model ourselves or call a closed API?” gets answered badly because people compare the wrong numbers. They put the per-token API price next to the per-token cost of a saturated GPU and conclude self-hosting is 5-10x cheaper. That comparison is real only at a utilization you almost never hit in production. The honest answer depends on three things you can actually measure: your sustained token throughput, the GPU utilization you can realistically maintain, and whether a license or regulation forces your hand regardless of cost.
This piece gives you a break-even model you can plug your own numbers into, the specific failure modes that blow up the self-hosting case, and the legal constraints that override the economics entirely. I’m going to be explicit about what’s a hard number from a primary source and what’s an estimate, because most writing on this topic launders blog-table guesses into false precision. Where I say “check the live pricing page,” it’s because the number genuinely churns and any figure I print will be stale within a quarter.
The comparison everyone gets wrong
A closed API charges you per token, metered, with zero idle cost. A self-hosted deployment charges you for GPU-hours whether or not a request is in flight. So the unit you must compare is not $/token-at-full-load versus $/token-API. It’s total monthly GPU spend versus total monthly API spend at your actual traffic shape.
The cost of a self-hosted token is:
cost_per_token = gpu_hourly_cost / (throughput_tokens_per_sec * 3600 * utilization)
Two of those terms are where people lie to themselves. throughput_tokens_per_sec is the aggregate tokens/sec across all concurrent requests under continuous batching, not single-stream decode speed — and utilization is the fraction of wall-clock time the GPU is actually doing useful decode, averaged over the billing period including nights, weekends, and traffic troughs.
Plug in a concrete, checkable example. An 8×H100 node from a mid-tier GPU cloud runs roughly $20-30/hr as of mid-2026 (on-demand list rates from providers like Lambda, RunPod, and Crusoe sit in this band; verify current rates, they move with supply). Take $24/hr. A 70B-class dense model in FP8 on that node, served with vLLM continuous batching, sustains on the order of 2,000-4,000 aggregate output tokens/sec at healthy batch sizes — the spread is real and depends on sequence length, quantization, and how full your batches run (vLLM’s own throughput benchmarks and the PagedAttention paper, Kwon et al., SOSP 2023, document the batching mechanism behind these numbers). Take 3,000 tok/s.
At 100% utilization:
$24 / (3000 * 3600) = $2.22 per million output tokens
That looks devastating against a frontier closed API in the several-dollars-to-tens-of-dollars-per-million-output-tokens range. But 100% utilization is a fiction. Your actual cost is that number divided by your real utilization:
50% util -> $4.44 / M tokens
20% util -> $11.11 / M tokens
10% util -> $22.22 / M tokens
This is the whole game. The self-hosting advantage is entirely a function of utilization, and utilization is the variable teams are worst at estimating.
The utilization trap, stated precisely
The naive version of this argument — “an idle GPU still costs money, so any non-saturated GPU pays a 10x penalty” — is wrong, and it’s worth being precise about why, because the correction is also the solution.
Continuous batching (vLLM’s PagedAttention, TGI’s equivalent, TensorRT-LLM’s in-flight batching) means a GPU serving a steady trickle of requests doesn’t sit at 10% just because each individual stream is short. As long as requests arrive fast enough to keep a batch populated, the GPU stays busy interleaving them. So a single H100 at, say, 30 requests/sec of short completions can run near-saturated even though no single request saturates it. PagedAttention exists precisely to make this packing efficient.
The 10x penalty bites in three specific situations, not generically:
- Aggregate QPS too low to fill a batch. If your traffic is genuinely 2 requests/minute, no batching trick helps — there’s nothing to batch. The GPU idles between requests and you pay for the idle.
- Latency SLOs that forbid batching. If you must answer in under, say, 200ms time-to-first-token, you can’t wait to accumulate a batch, and you can’t let batch size grow to the point where queueing delay violates the SLO. Tight tail-latency requirements force small batches, which forces low utilization. This is the trap that catches interactive products.
- Spiky diurnal traffic on fixed hardware. Provision for the peak and you idle through the trough; provision for the mean and you drop or queue requests at peak. A 4:1 peak-to-trough ratio (ordinary for a B2B product on business hours) means your average utilization is a fraction of peak even if peak is saturated.
The correct response to low or spiky load is not to eat the 10x — it’s to not own the GPU. Right-size to a smaller card, use a serverless/per-token GPU offering (Modal, Baseten, RunPod serverless, or the per-token open-model endpoints from Together, Fireworks, DeepInfra), or just use a closed API. Owning a fixed GPU only makes sense once your load is high enough and steady enough that you’d keep it busy. That threshold is the real break-even.
One break-even model, reconciled
Earlier writing on this — including an earlier version of this article — floated three different thresholds as if interchangeable: “15-20M tokens/day,” “~16M on H100,” and “$20K/month.” Those are different units and they don’t automatically agree. Here’s the single model that connects them, with every assumption stated.
Assume you’ve decided to own hardware and you want it busy enough to beat the API. Take the 8×H100 node above: $24/hr → $17,280/month at 100% uptime. Say you can realistically hold 40% average utilization after accounting for diurnal troughs (optimistic but achievable for a steady B2B workload with off-peak batch jobs filling gaps).
Effective sustained throughput: 3000 tok/s × 0.40 = 1200 tok/s → about 103.7M output tokens/day, or ~3.1B/month.
Self-hosted cost at that utilization: $17,280 / 3,110M = $5.56 per million output tokens.
Now the break-even against an API depends entirely on the API’s price, which is why a single token/day cutoff is meaningless without naming the comparison price. Self-hosting wins only if your blended API rate exceeds ~$5.56/M output tokens at the quality tier you actually need. Against a frontier flagship at $10-15/M output, you win comfortably. Against a cheap open-model API endpoint already selling the same weights at $2-4/M output, you lose — someone else is running that GPU at higher utilization than you will, and reselling the margin.
That last point is the one most “build vs buy” analyses miss: your real competitor isn’t the frontier closed API, it’s the commodity open-model endpoint serving the exact model you’d self-host, at a utilization you can’t match. Self-hosting has to beat that, plus it has to justify the engineering and on-call cost of running inference infra.
So the honest cutoffs, all derived from the same model:
- The $/month figure ($17K for this node) is just the hardware bill — necessary, not sufficient.
- The tokens/day figure (~100M/day at 40% util on this node) is the volume you must actually serve to hit the $5.56/M cost. Serve less and your real cost rises inversely.
- The build-vs-rent decision is: do you have a steady workload large enough to keep an owned node busy and a per-token cost target that beats both the frontier API and the commodity open endpoint?
If your usage is well under tens of millions of tokens/day of steady traffic, the arithmetic says rent per-token. If you’re sustaining hundreds of millions of tokens/day of predictable load, owning starts winning — and the win grows with scale because the fixed engineering cost amortizes. The “memorize one number” instinct is the bug; the model is three lines of arithmetic and you should run it on your own traffic.
Caching cuts both ways
A common move is to claim closed APIs win on prompt caching — providers like Anthropic and OpenAI discount cached input tokens steeply (Anthropic advertises up to a 90% discount on cache reads; see Anthropic’s prompt-caching documentation). For RAG and long-system-prompt workloads where the same prefix repeats, that’s a real and large saving that tilts the math toward closed.
But it’s not a clean inversion, and it’s dishonest to present it as one. Cache writes cost a premium over the base input rate (Anthropic prices a 5-minute cache write above standard input tokens), so you only come out ahead if the cached prefix is reused enough times before it expires to amortize the write premium. Default TTLs are short (on the order of minutes unless you pay for extended caching), so bursty or low-frequency traffic keeps paying write premiums and re-warming the cache, eroding or erasing the discount. And self-hosted vLLM has had automatic prefix caching for a while — you get the same prefix-reuse benefit locally without per-write fees. Caching helps whoever has reuse density; it’s not inherently a closed-model advantage.
What actually forces the decision: licenses and law
The economics are often moot because a license clause or a regulation removes the option before you reach the spreadsheet. These are the constraints to read carefully — from the primary text, not a summary.
Open weights are not open source, and not uniformly licensed. “Open-weight” means you can download and run the weights; it says nothing about commercial terms. Apache-2.0 and MIT models (much of the Qwen and Mistral lineage, for example — check the specific model card) are genuinely permissive. Meta’s Llama licenses are not OSI-open: they carry an acceptable-use policy and, historically, an MAU threshold (the Llama 2/3 family required a separate license for products exceeding 700M monthly active users). Read the license attached to the specific checkpoint you intend to ship; vendors change terms between releases.
The Llama 4 EU restriction is the one most often overstated. The accurate version: Meta’s acceptable-use policy and license for the Llama 4 multimodal models restrict the license grant for individuals domiciled in, and companies with a principal place of business in, the EU. This is a license-grant limitation, not a regulatory ban, and its practical reach is narrower and messier than “Llama 4 is banned in the EU” — it turns on entity domicile, applies to the multimodal models specifically, and the treatment of text-only derivatives, downstream fine-tunes, and hosting-versus-end-use is contested. If this matters to you, read Meta’s current license text and AUP for the exact checkpoint and get your own counsel; don’t rely on a Substack paraphrase (including this one).
The EU AI Act splits obligations by role, and self-hosters are usually deployers, not providers — this distinction is load-bearing. The Act’s general-purpose AI (GPAI) regime, in force since August 2025, classifies models trained above roughly 10^25 FLOPs as carrying “systemic risk,” which triggers heavyweight obligations: adversarial testing/evaluation, serious-incident reporting, cybersecurity measures. The critical point that bad summaries get wrong: those systemic-risk obligations bind the GPAI provider — the entity that develops and places the model on the market — not the downstream enterprise that self-hosts it.
If you download an open-weight frontier model and run it internally, you are a deployer (and, depending on what you build, possibly a downstream provider of an AI system). You do not automatically inherit the model maker’s systemic-risk provider duties like frontier adversarial testing or model-incident reporting. What you do inherit are deployer obligations — and, importantly, the chance of becoming a provider yourself if you substantially modify or fine-tune the model such that you materially change its purpose, or if you put it on the market under your own name. Re-releasing a fine-tune publicly can flip you into provider status; running it behind your own product generally does not, though high-risk use-case rules (Annex III domains: hiring, credit, etc.) can still attach deployer duties regardless. The compliance burden of self-hosting is real but it is not “you now owe everything the model maker owes.” Map your specific role against the Act’s provider/deployer definitions (Articles 3, 25, 53) before assuming.
There’s a genuine tension here that closed APIs sidestep: a hosted provider absorbs the provider-side GPAI obligations for you. Self-hosting can pull you closer to provider status, especially if you fine-tune and redistribute. That regulatory absorption is a real, underrated reason teams in regulated EU contexts pay the closed-API premium.
Data residency and confidential inference are the other forcing functions. If contracts or law require that prompts never leave your trust boundary, self-hosting on infrastructure you control is the straightforward answer. Confidential-computing GPUs (NVIDIA’s H100/H200 confidential-computing mode runs the model inside a hardware TEE with attestation, so even the host operator can’t read the weights or activations) are the more exotic option — relevant if you want a third party to run the model without seeing your data, but it carries a measured throughput overhead and added operational complexity, so reach for it only when the threat model specifically demands operator-blind execution, not as a default.
A decision procedure you can run
Strip away the model churn and the framework reduces to a sequence:
- License gate. Read the specific checkpoint’s license and AUP. If your jurisdiction, entity domicile, or scale violates it, the option is closed — stop.
- Regulatory gate. Determine whether you’re a deployer or, after your modifications, a provider under the AI Act. Decide whether data-residency or operator-blind requirements force on-prem. If law forces self-hosting, cost is secondary — budget for it.
- Traffic shape. Estimate sustained tokens/day and the peak-to-trough ratio. Spiky or low → per-token serverless or closed API; you cannot keep an owned GPU busy.
- Break-even arithmetic. For owned hardware, compute
monthly_gpu_cost / (aggregate_tok_s × util × seconds_in_month)and compare against both the frontier API and the commodity open-model endpoint serving the same weights. You must beat the cheaper of the two. - Operational cost. Add the fully-loaded cost of an on-call team that owns inference reliability, autoscaling, and model upgrades. This is the line item that quietly kills marginal self-hosting cases.
Run your own numbers through that. Anyone handing you a single magic threshold — “self-host above X tokens a day” — without naming the GPU, the utilization, and the comparison price is selling you a conclusion, not a model.