Self-hosting AI models on a budget VPS: a cost analysis
Meta description: Practical guide to running Ollama and vLLM on budget VPS instances — hardware requirements, benchmarks, and when self-hosting beats API calls.
Tags: backend, cloud, devops, docker, architecture
TL;DR
Self-hosting LLMs on budget VPS instances ($20-80/month) works for specific workloads: internal tools, batch processing, and low-concurrency applications. For anything above ~10 concurrent requests or requiring frontier-model quality, API calls remain cheaper when you factor in engineering time. I’ll cover the numbers, the hardware floor, and the decision framework.

The promise vs. the reality
Every few weeks, someone on Hacker News posts about running Llama on a $5 VPS and “never paying for API calls again.” Having built production systems around both approaches, I can tell you the reality is messier than that. Self-hosting makes sense in specific scenarios, but the break-even math is less favorable than most people assume.
Hardware floor: what you actually need
Most teams get this wrong. They underestimate memory requirements. LLM inference is memory-bound, not compute-bound. A model’s parameter count directly dictates your RAM floor.
| Model | Parameters | Min RAM (Q4 Quantized) | Min RAM (FP16) | Recommended VPS |
|---|---|---|---|---|
| Phi-3 Mini | 3.8B | 3 GB | 8 GB | 8 GB / 4 vCPU |
| Llama 3.1 8B | 8B | 5 GB | 16 GB | 16 GB / 4 vCPU |
| Mistral 7B | 7.3B | 5 GB | 15 GB | 16 GB / 4 vCPU |
| Llama 3.1 70B | 70B | 40 GB | 140 GB | GPU instance required |
| Qwen2.5 32B | 32B | 20 GB | 64 GB | 64 GB / dedicated GPU |
The sweet spot for budget VPS is the 7-8B parameter range at Q4 quantization. Anything larger and you’re paying $200+/month for a GPU instance, which destroys the cost argument entirely.

Ollama vs. vLLM: choosing your runtime
Ollama is the “Docker of LLMs.” Dead simple to set up, great for single-user or low-concurrency workloads. One command and you’re running inference.
# That's literally it
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
docker exec -it ollama ollama pull llama3.1:8b-instruct-q4_K_M
vLLM is what you reach for in production. It implements PagedAttention for efficient KV-cache management, supports continuous batching, and handles concurrent requests significantly better.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct --quantization awq
The numbers tell the story:
| Metric | Ollama (CPU, 8B Q4) | vLLM (GPU, 8B AWQ) | Claude API (Sonnet) |
|---|---|---|---|
| Tokens/sec (single request) | 8-15 | 40-80 | 80-160 |
| Tokens/sec (10 concurrent) | 1-3 per req | 25-50 per req | 80-160 per req |
| Time-to-first-token | 1-4s | 0.1-0.3s | 0.3-0.8s |
| Monthly cost (typical VPS) | $24-48 | $80-250 (GPU) | Usage-dependent |
CPU inference with Ollama gives you roughly 10 tokens/second on a 4-vCPU machine. That’s adequate for internal tools, background summarization, or a personal assistant. It falls apart under concurrency.
The cost break-even analysis
Time for math. Common workload: 500 requests/day, averaging 500 input tokens and 300 output tokens each.
API cost (Claude Sonnet): ~$3.75/day, so about $112/month.
Self-hosted (Ollama on Hetzner CX42, 16GB, $24/month): Fixed $24/month, but each request takes 20-40 seconds on CPU. At 500 requests/day, you need ~4 hours of sequential processing. Feasible if batched, but latency-sensitive workloads won’t tolerate it.
Self-hosted (vLLM on GPU, ~$150/month): Fixed cost, handles the load comfortably, but you’re paying more than the API and getting a less capable model.
The break-even point where self-hosting wins is roughly 2,000+ requests/day with relaxed latency requirements and an 8B model that meets your quality bar.
When self-hosting actually makes sense
Self-hosting wins in a few scenarios:
- Data sovereignty. Regulated industries where prompts cannot leave your infrastructure. No amount of API pricing comparison matters if compliance says no.
- High-volume, low-quality-bar tasks. Classification, extraction, summarization where a 7B model performs adequately. At 10K+ requests/day, the economics flip hard in your favor.
- Predictable budgets. Fixed monthly cost vs. variable API billing. Finance teams love this, even when total cost is slightly higher.
Self-hosting loses when you need frontier-model reasoning, low latency under concurrency, or when your engineering team’s time maintaining infrastructure exceeds the API savings. That last one bites people more often than they expect.
Production checklist
If you decide to self-host, don’t skip these:
- Health checks and auto-restart. LLM processes crash under memory pressure. Use Docker restart policies and liveness probes.
- Request queuing. Put a queue (Redis, BullMQ) in front of your inference server. CPU inference cannot handle burst traffic.
- Monitoring. Track tokens/second, queue depth, and memory usage. OOM kills are your primary failure mode.
- Model pinning. Pin your quantized model hash. Upstream quantization changes can silently alter output quality.
Takeaways
Start with the API, then migrate specific workloads. Profile your actual usage for two weeks. Only self-host workloads that are high-volume, latency-tolerant, and where an 8B model meets your quality threshold.
Budget VPS means CPU-only means Ollama. If you’re on a $20-50/month VPS, use Ollama with a Q4-quantized 7-8B model. Accept the 10 tok/s ceiling and design your system around async processing.
Factor in engineering cost honestly. If your team spends 8 hours setting up, tuning, and maintaining a self-hosted deployment, that’s $800-2,000 in engineering time. You need months of sustained savings to recoup that.