Self-Hosting an LLM vs. Calling an API: The Real Cost Math
The question “should we self-host an open-weight model or pay for an API?” almost never has a clean answer at the model level, because the thing that actually determines the cost isn’t the model or even the GPU price. It’s utilization. A GPU you own costs the same whether it serves one request per second or a hundred; an API charges you per token and zero when idle. Everything else — quantization, batching engine, GPU generation, frontier vs. open weights — moves the breakeven point by a factor of two or three. Utilization moves it by an order of magnitude. So this piece works the cost math from the bottom up: what a served token actually costs you on owned hardware, why that number is dominated by duty cycle, and where the API genuinely wins on dollars rather than convenience.
The unit that matters: cost per million tokens, not cost per hour
GPU rental and API pricing are quoted in different units on purpose, and comparing them requires converting both to the same denominator: dollars per million output tokens ($/Mtok). Input tokens are cheaper to process than output tokens because prefill is compute-bound and highly parallel while decode is memory-bandwidth-bound and sequential, so most serious comparisons normalize on output, or on a blended input:output ratio (3:1 is a common assumption for chat workloads).
To get from $/GPU-hour to $/Mtok you need throughput:
$/Mtok = ($/GPU-hour) / (output_tokens_per_second * 3600 / 1e6)
That single equation is where most napkin math goes wrong, because output_tokens_per_second is not a property of the model. It’s a property of the model at a given batch size and sequence length on a given engine. A 70B model on one H100 might do 30–40 output tok/s for a single request and 1,500–2,500 aggregate tok/s across a saturated batch of concurrent requests. That’s a ~50x spread, and it maps directly onto a 50x spread in cost per token. Any throughput number quoted without a batch size and context length attached is close to meaningless — treat vendor “tokens per second” headlines as upper bounds measured under conditions you won’t reproduce, and derate accordingly.
The reason the aggregate number is so much higher than the per-request number is continuous batching plus paged KV-cache management. The vLLM team’s SOSP 2023 paper on PagedAttention reports 2–4x higher throughput than the previous best serving systems (Orca and FasterTransformer) at the same latency, and up to ~24x over naive HuggingFace Transformers, by cutting KV-cache memory fragmentation from the 60–80% typical of contiguous allocators down to under 4% (Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP 2023). The 24x figure is the headline people quote; it’s a comparison against an unoptimized baseline almost nobody runs in production. The 2–4x over a real serving system is the number that matters if you’re choosing between engines, and it’s the one to plan around.
The practical consequence: if you self-host without continuous batching and paged attention (i.e., you naively loop model.generate), your effective cost per token is several times higher than the numbers below, and you will lose the cost comparison to the API on every axis. The cost case for self-hosting assumes you are running vLLM, SGLang, TensorRT-LLM, or equivalent at high batch occupancy. If you can’t keep the batch full, stop reading and use the API.
What an owned GPU actually costs per hour
Here’s where the original framing of “buying is worse than renting” falls apart under arithmetic. Take an H100 80GB SXM. As a fair share of an 8-GPU node (street price roughly $250K–$300K all-in for the node), call it ~$27K of capex attributable to one GPU. Amortized straight-line over a 3-year life at 24/7 availability:
$27,000 / (3 years * 8,760 hours) = $1.03 / GPU-hour
That is the pure capital cost. Now power. An H100 SXM has a 700W TDP. At a realistic average draw near TDP under load:
700W * 730 hr/month / 1000 = 511 kWh/month
511 kWh * $0.12/kWh = $61/month (≈ $0.084/GPU-hour)
511 kWh * $0.20/kWh = $102/month (≈ $0.140/GPU-hour)
Those are the GPU’s own electricity numbers, and they are the ones to use — not 2x-inflated figures. Cooling and the rest of the facility are not a second line item you add on top of an already-inflated kWh number; they’re a multiplier on the GPU’s draw, captured by datacenter PUE. A reasonably efficient facility runs PUE ≈ 1.2–1.4. Applying 1.4:
511 kWh * 1.4 = 715 kWh/month
715 kWh * $0.12 = $86/month (≈ $0.118/GPU-hour, all-in power incl. cooling)
Add hosting/colo, networking, and a slice of an ops engineer’s salary — call it $0.10–$0.20/GPU-hour as a rough but honest placeholder. Total fully-loaded cost of an owned, fully-utilized H100:
$1.03 (capex) + $0.12 (power+cooling) + ~0.15 (ops) ≈ $1.30 / GPU-hour
Compare that to rental. As of mid-2026, on-demand H100 pricing spans roughly $1.40/hr (budget neoclouds like Thunder Compute) to $6.88/hr (AWS) and $12.29/hr (Azure), with a market median around $2.30–$3.10/hr; spot/interruptible drops to $0.34–$1.03/hr on marketplaces like Vast.ai and Spheron (aggregated across 15+ providers by getDeploying, Spheron, and IntuitionLabs, 2026 — these are marketplace listings, not audited rates, and the cheap end carries interruption and reliability risk). Reserved/committed pricing from mainstream providers tends to land around $2–$3/hr.
So the corrected conclusion is the opposite of “buying is worse than renting”: at full utilization, an owned H100 at ~$1.30/hr all-in undercuts on-demand and reserved rental, and roughly ties the cheapest spot rates while avoiding interruption risk. Buying wins on raw dollars — if you keep it busy.
The utilization trap, which is the whole game
That “if” is everything. The capex line ($1.03/hr) accrues 24/7 whether or not a request is in flight. Power and ops scale partly with load, but capital does not. So your served-equivalent cost — cost per hour of actual useful work — is the all-in hourly cost divided by your duty cycle:
| Utilization | Served-equivalent $/GPU-hour |
|---|---|
| 100% | $1.30 |
| 60% | $2.16 |
| 40% | $3.24 |
| 25% | $5.18 |
At 25% utilization — which is generous for an internal tool with business-hours, single-region, spiky traffic — your owned H100 costs more per useful hour than renting one on demand, and far more than the API. This is the actual reason most teams should not buy GPUs, and it has nothing to do with the sticker price. A GPU bought for $27K and run at 25% is a $27K asset doing $6,750 of work per year.
Rental has the same trap, slightly muted: you can spin instances down, but only at the granularity your provider bills and your autoscaler reacts, and cold starts on a 70B model (weights load + CUDA graph capture) are tens of seconds to minutes, so you over-provision to hide them. The API has no trap at all on this axis — idle costs zero. This is the single biggest structural advantage of per-token pricing, and it’s why the API wins for the long tail of low- and spiky-volume workloads regardless of what the per-token math says.
The breakeven, then, is a utilization threshold, not a model choice. Work it backward: if your self-hosted stack delivers tokens at, say, $0.40/Mtok at 100% utilization and the comparable API charges $0.60/Mtok, self-hosting only wins once your sustained utilization is high enough that the served-equivalent cost stays under $0.60. Below that line you’re paying for idle silicon.
Now the per-token numbers, on real prices
Convert the owned-H100 cost to tokens using a defensible throughput. A 70B-class model (Llama 3.3 70B, FP8) on a single H100 at high batch occupancy realistically sustains on the order of 1,500–2,000 aggregate output tok/s; the exact figure depends on context length, output length, and how full you keep the batch, so take a midpoint of 1,500 tok/s and treat it as deratable:
1,500 tok/s * 3600 = 5.4M output tok/hour
$1.30/GPU-hour / 5.4 ≈ $0.24 / Mtok output (at 100% utilization)
At 40% utilization that becomes ~$0.60/Mtok output. Now the API side. DeepInfra lists Llama 3.3 70B at $0.23/Mtok input and $0.40/Mtok output, with an FP8 “Turbo” variant at $0.10/$0.32 (DeepInfra pricing, 2026; cross-checked against Artificial Analysis provider benchmarks). Together, Fireworks, and other open-weight hosts cluster in the same band.
So for the same open-weight model, a serverless API at ~$0.40/Mtok output is competitive with — and at low-to-moderate utilization cheaper than — running it yourself, while carrying none of the capacity-planning, on-call, or idle-cost burden. Self-hosting Llama 70B beats the API on dollars only in a specific regime: sustained high volume (millions of tokens per minute, continuously), where you can hold utilization above ~60% and amortize the engineering effort. The crossover is real but narrower than the self-hosting enthusiasm implies.
The picture changes completely when you compare open weights you host against a frontier closed model. Frontier output pricing is one to two orders of magnitude higher than open-weight hosting. When a 70B open model is genuinely good enough for your task — and for extraction, classification, routing, summarization, and most structured-output jobs it often is — moving off a frontier API onto self-hosted or serverless open weights is where the large savings live. The win is the model substitution, not the hosting decision. Don’t conflate the two: “self-hosting saved us 90%” usually means “we stopped paying frontier prices for a job a 70B model does fine.”
Quantization: real lever, oversold precision
Quantization is the main knob that improves the self-hosting math, because it raises throughput (more requests fit in KV cache, kernels move fewer bytes) and lets you fit bigger models on fewer GPUs. FP8 on Hopper/Ada (native tensor-core support) roughly halves weight and KV-cache memory versus BF16, and on a memory-bandwidth-bound decode workload that translates to a meaningful throughput gain — though the multiplier depends on the engine and kernel, not just the format, so “doubles throughput” is a hope, not a guarantee.
The quality cost is task-dependent, and quoting a flat “<1–2% loss” is false precision. The empirically defensible shape: for large models (70B+), FP8 post-training quantization shows under ~0.5% MMLU delta and is often below benchmark noise on MMLU/HellaSwag/GSM8K; smaller models (≤7B) see more like 1–2%. INT8 reaches similar memory savings but is harder to get right — it clips activation outliers and tends to lose more accuracy on large language models than FP8 unless you do careful per-channel calibration. And degradation concentrates on exactly the workloads people care about most: multi-step math, scientific/numeric reasoning, long-context retrieval, and code generation regress more than chat or summarization (consistent across Spheron’s FP8 writeups, Latitude’s quantization tests, and the mixed-precision literature, e.g. arXiv:2510.16805, 2025–2026). Below 4-bit (INT3/INT2) the wheels come off — 4–12 point benchmark drops.
Practical rule: FP8 is close to free for serving most 70B production workloads and is the default to reach for. INT4/AWQ buys more memory headroom and is fine for chat-shaped tasks but should be evaluated on your eval set before trusting it on anything reasoning-heavy. Never accept a vendor’s single-benchmark quantization claim for a reasoning or long-context use case — measure it.
GPU generation and the moving target
Hardware generation shifts the math under you. Newer accelerators (B200-class) carry higher sticker and on-demand prices but deliver higher throughput per dollar on large models due to more memory bandwidth and bigger HBM, which can lower $/Mtok even at a higher $/hour. Spot/marketplace rates for newer parts also fall fast once supply catches up. The load-bearing point: never reason about $/hour in isolation. A $5/hr GPU that serves 3x the tokens of a $2/hr GPU is cheaper per token. Always convert to $/Mtok at your batch size before comparing generations — and re-derive it when you change the model, the context length, or the engine, because all three move throughput.
A decision procedure that survives contact
Stop asking “self-host or API?” and answer three measurable questions instead.
-
What is your sustained, not peak, token volume, and how spiky is it? Plot tokens/minute over a representative week. If the curve spends most of its time near zero with short spikes, the API’s zero-idle-cost wins and nothing else matters. Self-hosting needs a high, flat floor.
-
Can you actually hold the batch full? This is an engineering capability question, not a model question. It requires a real serving engine (vLLM/SGLang/TensorRT-LLM), an autoscaler tuned to your latency SLO, and request volume high enough to saturate. If any of those is missing, your effective utilization collapses and the owned-hardware math inverts against you.
-
Is the open model good enough that you’d run it whether you hosted it or not? If yes, you have two cheap paths (serverless open-weight API and self-host) and should pick on volume and ops appetite. If you genuinely need a frontier model, self-hosting isn’t even on the table — you can’t host what you don’t have the weights for.
The honest summary: for steady, high-volume, latency-tolerant batch and serving workloads where an open-weight model suffices and you can keep a real batching engine saturated above ~50–60% utilization, self-hosting on owned or reserved GPUs is cheaper per token than the equivalent API, often substantially. For everything else — spiky traffic, low or uncertain volume, frontier-quality requirements, or no appetite to run an inference stack on call — the API is cheaper and simpler, and the per-token premium is the correct price for offloading idle cost and operational risk. The numbers will keep moving as GPU generations turn over and open-weight hosts compete prices down, but the structure won’t: capital is paid whether or not you use it, and tokens are paid only when you do.