MVP Factory
ai startup development

Streaming LLM Tokens to 10K Concurrent Users: Backpressure, Coroutine Channels, and the SSE Fan-Out Architecture That Scales Without Melting Your Server

KW
Krystian Wiewiór · · 5 min read

TL;DR

Streaming LLM tokens via Server-Sent Events sounds simple until 10,000 clients connect and half of them are on slow networks. The architecture that survives: coroutine-per-connection with structured concurrency, bounded Channel buffers for backpressure, graceful connection draining during deploys, and careful memory budgeting. On a 4GB container, your real ceiling is around 8,000-12,000 concurrent SSE connections, if you get the buffer math right.


The problem: fan-out at token speed

LLM APIs emit tokens every 20-80ms. When you proxy those tokens to thousands of concurrent users via SSE, every connection becomes a long-lived coroutine holding an open HTTP response. One slow client that can’t consume fast enough will bloat your buffers, and without backpressure, you’re one GC pause away from an OOM kill.

I’ve built production systems that sit between LLM providers and end users, and the naive approach (unbounded lists, no draining strategy, fire-and-forget writes) collapses around 2,000 connections. Most teams get three things wrong: buffer sizing, backpressure policy, and deploy-time draining. Let me walk through each.

Coroutine channels as the backbone

The core pattern is a bounded Channel<String> per SSE connection, fed by a shared upstream coroutine consuming the LLM stream:

val upstream = Channel<String>(capacity = 64) // shared LLM token source

fun fanOut(clients: List<Channel<String>>, token: String) {
    for (client in clients) {
        client.trySend(token).onFailure {
            // Client buffer full — apply backpressure policy
            client.close() // or drop oldest, depending on SLA
        }
    }
}

Each client gets its own bounded channel (I recommend 32-128 slots). When a slow client fills its buffer, trySend fails immediately. No blocking the upstream, no cascading stalls.

Why bounded channels beat unbounded queues

ApproachMemory under loadSlow client impactFailure mode
Unbounded list per clientGrows without limitHeap exhaustionOOM kill, all clients die
Single shared channelBoundedSlowest client blocks allHead-of-line blocking
Bounded channel per clientPredictable ceilingOnly that client affectedGraceful disconnect

The numbers make this obvious. With unbounded queues at 10K connections, a single stalled client accumulating 50,000 tokens at ~40 bytes each eats 2MB. Multiply by a few hundred slow mobile clients and you’ve consumed your entire heap.

The memory math for a 4GB container

This arithmetic determines your actual concurrency ceiling:

ComponentPer-connection costAt 10K connections
Coroutine stack~1-2 KB10-20 MB
Bounded channel (64 slots x 40B)~2.5 KB25 MB
Ktor/Netty response buffer~8 KB80 MB
Connection metadata + headers~1 KB10 MB
Total per connection~13 KB~130 MB

On a 4GB container with ~2.5GB available heap (after JVM overhead, metaspace, GC headroom), you land at roughly 12,000 connections before pressure mounts. In practice, I target 8,000-10,000 to leave room for burst traffic and GC breathing room. If you need more, scale horizontally. Don’t increase buffer sizes.

Connection draining for zero-downtime deploys

During rolling deployments, you can’t just kill 10,000 open SSE connections. The strategy:

  1. Stop accepting new connections. Remove the pod from the load balancer.
  2. Send a custom SSE event (event: reconnect) telling clients to reconnect. They’ll hit a healthy pod.
  3. Set a drain deadline (30 seconds) and forcibly close any remaining connections after it expires.
  4. Use structured concurrency so coroutineScope ensures all child coroutines complete or cancel cleanly.
suspend fun drainConnections(clients: List<ClientSession>, deadline: Duration) {
    withTimeoutOrNull(deadline) {
        clients.forEach { it.sendEvent("reconnect", """{"reason":"deploy"}""") }
        clients.forEach { it.awaitDisconnect() }
    }
    // Force-close stragglers after deadline
    clients.forEach { it.close() }
}

Without this, Kubernetes will SIGTERM your pod, TCP connections reset, and users see a broken stream with no retry hint.

Structured concurrency does the heavy lifting

Every SSE connection runs inside a coroutineScope tied to the request lifecycle. When a client disconnects, the coroutine cancels. When the server drains, all child coroutines cancel cooperatively. No leaked coroutines, no zombie connections, no thread pool exhaustion.

This is where Kotlin’s structured concurrency model pays off compared to thread-per-connection or callback-based approaches. Cancellation propagates automatically through the entire coroutine tree, which means you don’t write cleanup logic. You just structure your scopes correctly and the runtime handles the rest.

What to actually do

Budget ~13-15 KB per SSE connection and run the memory math before choosing your container size. On 4GB, plan for 8K-10K connections max with comfortable headroom, and scale horizontally beyond that.

Use bounded channels (32-128 slots) per client with trySend for non-blocking fan-out. Drop or disconnect slow clients rather than letting unbounded buffers eat your heap. I know disconnecting clients feels aggressive, but the alternative is an OOM that disconnects everyone.

Implement connection draining from day one. Send a reconnect event, set a deadline, and lean on structured concurrency to guarantee cleanup. This will save you the first time you push a hotfix under load. Retrofitting it after an incident is miserable.

The architecture isn’t complex. It’s disciplined. Bounded buffers, predictable memory, cooperative cancellation. That’s what keeps your server running at 10K concurrent streams instead of melting under them.


Share: Twitter LinkedIn