Phase 7 · Asset & Tech Architecture
AI API vs Self-Hosted GPU Calculator
Per-token API pricing is a dream until you scale — then it's a tax. Find the exact monthly volume where renting your own GPUs undercuts the API, and what you'd save.
Under the hood
The math, fully exposed
We total monthly tokens, price them both ways, and size the GPU fleet (730 hours/month):
Monthly tokens = MAU × requests/user × tokens/request
API cost = monthly tokens ÷ 1,000,000 × API price
GPU capacity = throughput × 3,600 × 730 × utilization
GPUs needed = ⌈ monthly tokens ÷ GPU capacity ⌉
Self-host cost = GPUs × hourly rate × 730
Break-even = the volume where API cost passes one GPU's monthly cost
- APIs win when idle: you pay per token, so low or spiky volume favours the API — no idle cost, no ops.
- Utilization is everything for self-hosting: a GPU billed 24/7 only pays off if you keep it busy. Halve utilization and you nearly double your effective cost per token.
- Compute is not total cost: this ignores the engineering and reliability work self-hosting adds. Add a realistic estimate of those hours before you switch.
Your directives
What to do next, based on your numbers
Adjust the sliders to generate tailored recommendations.
Answers
Frequently asked questions
When should I move from an AI API to self-hosted GPUs?
Roughly when your monthly token volume is high enough that the per-token API price exceeds the cost of a GPU that can serve it. APIs are cheapest at low and bursty volume because you pay nothing when idle. Once you have steady, high throughput, a rented GPU you keep busy can cost a fraction per token — but only if you actually keep it busy and can absorb the engineering overhead.
How do I estimate GPU token throughput?
Throughput (tokens per second) depends on the model size, GPU, batching and quantization. A smaller open model on a mid-range GPU might do hundreds to a few thousand tokens/sec with batching; large models need bigger or multiple GPUs. Benchmark your actual model and GPU — vendor numbers assume ideal batching you may not hit in production.
What hidden costs does self-hosting add?
Plenty the sticker price hides: engineering time to build and maintain inference infrastructure, on-call and reliability work, idle GPU time when traffic dips, model updates, and scaling for spikes. This tool compares raw compute cost — add a realistic estimate of engineer-hours before deciding, because a few hours a month of senior time can erase the savings.
What about latency and scaling?
APIs scale instantly and absorb spikes for you; self-hosting means provisioning for peak (more idle cost) or risking queue delays. APIs also ship new models continuously. Self-hosting gives control, data residency and cost predictability at high volume. The compute math is necessary but not sufficient — weigh these operational factors alongside it.