Phase 7 · Asset & Tech Architecture

AI API vs Self-Hosted GPU Calculator

Per-token API pricing is a dream until you scale — then it's a tax. Find the exact monthly volume where renting your own GPUs undercuts the API, and what you'd save.

Your inputs

Seven levers. The compute comparison re-solves on every tick.

50000 MAU

Users hitting your AI feature each month.

40/mo

Average AI calls per user.

1500 tok

Input + output tokens per call.

$3

Blended input/output API rate.

$1.5

Rental rate for one GPU instance.

2000 tok/s

Tokens/sec one GPU sustains for your model.

40%

Average share of capacity you keep busy.

Cheaper at your volume
Monthly compute cost compared.
API monthly cost
Self-host monthly cost
GPUs needed
Break-even volume

Under the hood

The math, fully exposed

We total monthly tokens, price them both ways, and size the GPU fleet (730 hours/month):

Monthly tokens = MAU × requests/user × tokens/request
API cost = monthly tokens ÷ 1,000,000 × API price
GPU capacity = throughput × 3,600 × 730 × utilization
GPUs needed = ⌈ monthly tokens ÷ GPU capacity ⌉
Self-host cost = GPUs × hourly rate × 730
Break-even = the volume where API cost passes one GPU's monthly cost
  • APIs win when idle: you pay per token, so low or spiky volume favours the API — no idle cost, no ops.
  • Utilization is everything for self-hosting: a GPU billed 24/7 only pays off if you keep it busy. Halve utilization and you nearly double your effective cost per token.
  • Compute is not total cost: this ignores the engineering and reliability work self-hosting adds. Add a realistic estimate of those hours before you switch.

Your directives

What to do next, based on your numbers

Adjust the sliders to generate tailored recommendations.

Answers

Frequently asked questions

When should I move from an AI API to self-hosted GPUs?
Roughly when your monthly token volume is high enough that the per-token API price exceeds the cost of a GPU that can serve it. APIs are cheapest at low and bursty volume because you pay nothing when idle. Once you have steady, high throughput, a rented GPU you keep busy can cost a fraction per token — but only if you actually keep it busy and can absorb the engineering overhead.
How do I estimate GPU token throughput?
Throughput (tokens per second) depends on the model size, GPU, batching and quantization. A smaller open model on a mid-range GPU might do hundreds to a few thousand tokens/sec with batching; large models need bigger or multiple GPUs. Benchmark your actual model and GPU — vendor numbers assume ideal batching you may not hit in production.
What hidden costs does self-hosting add?
Plenty the sticker price hides: engineering time to build and maintain inference infrastructure, on-call and reliability work, idle GPU time when traffic dips, model updates, and scaling for spikes. This tool compares raw compute cost — add a realistic estimate of engineer-hours before deciding, because a few hours a month of senior time can erase the savings.
What about latency and scaling?
APIs scale instantly and absorb spikes for you; self-hosting means provisioning for peak (more idle cost) or risking queue delays. APIs also ship new models continuously. Self-hosting gives control, data residency and cost predictability at high volume. The compute math is necessary but not sufficient — weigh these operational factors alongside it.