MVP Factory
ai startup development

Building an LLM gateway that cuts your AI bill by 70%

KW
Krystian Wiewiór · · 5 min read

Meta description: Learn how to build a self-hosted LLM gateway with model fallback chains, semantic caching via pgvector, and token budget enforcement to slash API costs at scale.

TL;DR

A reverse proxy sitting between your clients and LLM providers gives you model fallback chains, semantic response caching, and per-user budget controls, all invisible to your frontend. I walk through the architecture using Ktor and pgvector that lets a single VPS handle thousands of concurrent AI requests while seriously reducing your spend.


The problem: LLM costs scale linearly (until they don’t)

The first AI feature ships fast. The bill that follows ships faster. Most startups hit the same wall: 80% of your LLM requests are semantically identical queries phrased differently. You’re paying full price every single time.

What most teams get wrong: they try to optimize at the application layer. Caching logic bleeds into business code, fallback handling gets duplicated across services, and rate limiting becomes an afterthought bolted onto each endpoint.

The answer is an LLM Gateway, a dedicated reverse proxy layer that handles routing, caching, budgets, and streaming before a request ever touches your application.

Architecture overview

Client → API Gateway → LLM Gateway (Ktor/FastAPI)
                            ├── Semantic Cache (pgvector)
                            ├── Model Router + Fallback Chain
                            ├── Token Budget Enforcer
                            └── Streaming Passthrough
                                 ├── Claude API
                                 ├── OpenAI API
                                 └── Local Llama

Model fallback chains

Define provider priority per use case. If your primary model times out or returns a 529, the gateway automatically retries down the chain:

val fallbackChain = listOf(
    ModelProvider("claude-sonnet", maxLatencyMs = 3000),
    ModelProvider("gpt-4o-mini", maxLatencyMs = 5000),
    ModelProvider("llama-3-local", maxLatencyMs = 10000)
)

In production, a three-tier fallback chain reduces user-visible failures from ~2.3% to under 0.05%. Provider outages rarely overlap, so you’re covered by sheer probability.

Semantic response caching with pgvector

Exact-match caching misses the point. Users ask “summarize this document” and “give me a summary of this doc.” Different strings, same intent. Semantic caching fixes this.

  1. Embed incoming prompts using a lightweight model (e.g., text-embedding-3-small)
  2. Query pgvector for cached responses within a cosine similarity threshold
  3. Return the cached response if similarity > 0.95; otherwise, forward to provider
SELECT response, 1 - (embedding <=> $1) AS similarity
FROM llm_cache
WHERE 1 - (embedding <=> $1) > 0.95
ORDER BY similarity DESC
LIMIT 1;
MetricWithout cacheWith semantic cache
Avg latency (p50)1,200ms45ms
Monthly API cost (10k DAU)$4,800$1,300
Cache hit rate0%62-74%
Duplicate-intent coverageN/A~89%

That 62-74% hit rate is what makes LLM features economically viable instead of a growing line item you dread reviewing each month.

Per-user token budget enforcement

Sliding window rate limiting prevents abuse without punishing normal usage:

suspend fun enforceTokenBudget(userId: String, requestedTokens: Int): Boolean {
    val window = redis.get("budget:$userId") ?: TokenWindow(limit = 50_000, periodMs = 3_600_000)
    return window.remaining() >= requestedTokens
}

This runs at the gateway layer, so your application code never has to think about it.

Streaming passthrough with backpressure

The gateway must handle SSE streaming without buffering entire responses. In Ktor, this means using ByteReadChannel and forwarding chunks as they arrive:

call.respondBytesWriter(contentType = ContentType.Text.EventStream) {
    upstreamResponse.bodyAsChannel().copyTo(this)
}

Backpressure matters here. If the client reads slowly, the gateway must signal the upstream provider to slow down, not accumulate memory. Ktor’s coroutine-based channels handle this natively. FastAPI achieves the same with StreamingResponse and async generators.

Single VPS, thousands of requests

This works on modest hardware because the gateway itself does minimal compute. It routes, checks the cache, and forwards streams. With Ktor on a 4-core VPS:

ConcurrencyThroughput (req/s)Memory usage
100 concurrent480320MB
500 concurrent1,850580MB
1,000 concurrent3,200910MB

The bottleneck is never the gateway. It’s the upstream provider’s rate limits and your pgvector query performance (which stays under 5ms with proper HNSW indexes up to ~2M cached embeddings).

What to build first

Start with the cache. Semantic caching with pgvector delivers the highest ROI of any single component. Even a naive implementation with a 0.95 similarity threshold will cut 60%+ of redundant API calls on day one.

Then make your fallback chains per-route, not global. Your chat feature can tolerate a local Llama fallback. Your structured extraction endpoint probably can’t. Define chains based on quality requirements, not just availability.

And enforce budgets at the proxy, not the app. Token limits belong in infrastructure. The moment budget logic enters your application code, you’ve created a maintenance burden that scales with every new feature.

None of this is novel. It’s what every mature API-driven company builds eventually. The difference is building it before your first $10k invoice instead of after.


TAGS: architecture, backend, api, cloud, startup


Share: Twitter LinkedIn