Self-hosting AI models on a budget VPS: a cost analysis

Meta description: Practical guide to running Ollama and vLLM on budget VPS instances — hardware requirements, benchmarks, and when self-hosting beats API calls.

Tags: backend, cloud, devops, docker, architecture

TL;DR

Self-hosting LLMs on budget VPS instances ($20-80/month) works for specific workloads: internal tools, batch processing, and low-concurrency applications. For anything above ~10 concurrent requests or requiring frontier-model quality, API calls remain cheaper when you factor in engineering time. I’ll cover the numbers, the hardware floor, and the decision framework.

Self-hosting AI models on a budget VPS: a cost analysis

The promise vs. the reality

Every few weeks, someone on Hacker News posts about running Llama on a $5 VPS and “never paying for API calls again.” Having built production systems around both approaches, I can tell you the reality is messier than that. Self-hosting makes sense in specific scenarios, but the break-even math is less favorable than most people assume.

Hardware floor: what you actually need

Most teams get this wrong. They underestimate memory requirements. LLM inference is memory-bound, not compute-bound. A model’s parameter count directly dictates your RAM floor.

Model	Parameters	Min RAM (Q4 Quantized)	Min RAM (FP16)	Recommended VPS
Phi-3 Mini	3.8B	3 GB	8 GB	8 GB / 4 vCPU
Llama 3.1 8B	8B	5 GB	16 GB	16 GB / 4 vCPU
Mistral 7B	7.3B	5 GB	15 GB	16 GB / 4 vCPU
Llama 3.1 70B	70B	40 GB	140 GB	GPU instance required
Qwen2.5 32B	32B	20 GB	64 GB	64 GB / dedicated GPU

The sweet spot for budget VPS is the 7-8B parameter range at Q4 quantization. Anything larger and you’re paying $200+/month for a GPU instance, which destroys the cost argument entirely.

Self-Hosting AI Models on a Budget VPS

Ollama vs. vLLM: choosing your runtime

Ollama is the “Docker of LLMs.” Dead simple to set up, great for single-user or low-concurrency workloads. One command and you’re running inference.

# That's literally it
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
docker exec -it ollama ollama pull llama3.1:8b-instruct-q4_K_M

vLLM is what you reach for in production. It implements PagedAttention for efficient KV-cache management, supports continuous batching, and handles concurrent requests significantly better.

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct --quantization awq

The numbers tell the story:

Metric	Ollama (CPU, 8B Q4)	vLLM (GPU, 8B AWQ)	Claude API (Sonnet)
Tokens/sec (single request)	8-15	40-80	80-160
Tokens/sec (10 concurrent)	1-3 per req	25-50 per req	80-160 per req
Time-to-first-token	1-4s	0.1-0.3s	0.3-0.8s
Monthly cost (typical VPS)	$24-48	$80-250 (GPU)	Usage-dependent

CPU inference with Ollama gives you roughly 10 tokens/second on a 4-vCPU machine. That’s adequate for internal tools, background summarization, or a personal assistant. It falls apart under concurrency.

The cost break-even analysis

Time for math. Common workload: 500 requests/day, averaging 500 input tokens and 300 output tokens each.

API cost (Claude Sonnet): ~$3.75/day, so about $112/month.

Self-hosted (Ollama on Hetzner CX42, 16GB, $24/month): Fixed $24/month, but each request takes 20-40 seconds on CPU. At 500 requests/day, you need ~4 hours of sequential processing. Feasible if batched, but latency-sensitive workloads won’t tolerate it.

Self-hosted (vLLM on GPU, ~$150/month): Fixed cost, handles the load comfortably, but you’re paying more than the API and getting a less capable model.

The break-even point where self-hosting wins is roughly 2,000+ requests/day with relaxed latency requirements and an 8B model that meets your quality bar.

When self-hosting actually makes sense

Self-hosting wins in a few scenarios:

Data sovereignty. Regulated industries where prompts cannot leave your infrastructure. No amount of API pricing comparison matters if compliance says no.
High-volume, low-quality-bar tasks. Classification, extraction, summarization where a 7B model performs adequately. At 10K+ requests/day, the economics flip hard in your favor.
Predictable budgets. Fixed monthly cost vs. variable API billing. Finance teams love this, even when total cost is slightly higher.

Self-hosting loses when you need frontier-model reasoning, low latency under concurrency, or when your engineering team’s time maintaining infrastructure exceeds the API savings. That last one bites people more often than they expect.

Production checklist

If you decide to self-host, don’t skip these:

Health checks and auto-restart. LLM processes crash under memory pressure. Use Docker restart policies and liveness probes.
Request queuing. Put a queue (Redis, BullMQ) in front of your inference server. CPU inference cannot handle burst traffic.
Monitoring. Track tokens/second, queue depth, and memory usage. OOM kills are your primary failure mode.
Model pinning. Pin your quantized model hash. Upstream quantization changes can silently alter output quality.

Takeaways

Start with the API, then migrate specific workloads. Profile your actual usage for two weeks. Only self-host workloads that are high-volume, latency-tolerant, and where an 8B model meets your quality threshold.

Budget VPS means CPU-only means Ollama. If you’re on a $20-50/month VPS, use Ollama with a Q4-quantized 7-8B model. Accept the 10 tok/s ceiling and design your system around async processing.

Factor in engineering cost honestly. If your team spends 8 hours setting up, tuning, and maintaining a self-hosted deployment, that’s $800-2,000 in engineering time. You need months of sustained savings to recoup that.

Self-hosting AI models on a budget VPS: a cost analysis

TL;DR

The promise vs. the reality

Hardware floor: what you actually need

Ollama vs. vLLM: choosing your runtime

The cost break-even analysis

When self-hosting actually makes sense

Production checklist

Takeaways

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol