Streaming LLM Tokens to 10K Concurrent Users: Backpressure, Coroutine Channels, and the SSE Fan-Out Architecture That Scales Without Melting Your Server

TL;DR

Streaming LLM tokens via Server-Sent Events sounds simple until 10,000 clients connect and half of them are on slow networks. The architecture that survives: coroutine-per-connection with structured concurrency, bounded Channel buffers for backpressure, graceful connection draining during deploys, and careful memory budgeting. On a 4GB container, your real ceiling is around 8,000-12,000 concurrent SSE connections, if you get the buffer math right.

The problem: fan-out at token speed

LLM APIs emit tokens every 20-80ms. When you proxy those tokens to thousands of concurrent users via SSE, every connection becomes a long-lived coroutine holding an open HTTP response. One slow client that can’t consume fast enough will bloat your buffers, and without backpressure, you’re one GC pause away from an OOM kill.

I’ve built production systems that sit between LLM providers and end users, and the naive approach (unbounded lists, no draining strategy, fire-and-forget writes) collapses around 2,000 connections. Most teams get three things wrong: buffer sizing, backpressure policy, and deploy-time draining. Let me walk through each.

Coroutine channels as the backbone

The core pattern is a bounded Channel<String> per SSE connection, fed by a shared upstream coroutine consuming the LLM stream:

val upstream = Channel<String>(capacity = 64) // shared LLM token source

fun fanOut(clients: List<Channel<String>>, token: String) {
    for (client in clients) {
        client.trySend(token).onFailure {
            // Client buffer full — apply backpressure policy
            client.close() // or drop oldest, depending on SLA
        }
    }
}

Each client gets its own bounded channel (I recommend 32-128 slots). When a slow client fills its buffer, trySend fails immediately. No blocking the upstream, no cascading stalls.

Why bounded channels beat unbounded queues

Approach	Memory under load	Slow client impact	Failure mode
Unbounded list per client	Grows without limit	Heap exhaustion	OOM kill, all clients die
Single shared channel	Bounded	Slowest client blocks all	Head-of-line blocking
Bounded channel per client	Predictable ceiling	Only that client affected	Graceful disconnect

The numbers make this obvious. With unbounded queues at 10K connections, a single stalled client accumulating 50,000 tokens at ~40 bytes each eats 2MB. Multiply by a few hundred slow mobile clients and you’ve consumed your entire heap.

The memory math for a 4GB container

This arithmetic determines your actual concurrency ceiling:

Component	Per-connection cost	At 10K connections
Coroutine stack	~1-2 KB	10-20 MB
Bounded channel (64 slots x 40B)	~2.5 KB	25 MB
Ktor/Netty response buffer	~8 KB	80 MB
Connection metadata + headers	~1 KB	10 MB
Total per connection	~13 KB	~130 MB

On a 4GB container with ~2.5GB available heap (after JVM overhead, metaspace, GC headroom), you land at roughly 12,000 connections before pressure mounts. In practice, I target 8,000-10,000 to leave room for burst traffic and GC breathing room. If you need more, scale horizontally. Don’t increase buffer sizes.

Connection draining for zero-downtime deploys

During rolling deployments, you can’t just kill 10,000 open SSE connections. The strategy:

Stop accepting new connections. Remove the pod from the load balancer.
Send a custom SSE event (event: reconnect) telling clients to reconnect. They’ll hit a healthy pod.
Set a drain deadline (30 seconds) and forcibly close any remaining connections after it expires.
Use structured concurrency so coroutineScope ensures all child coroutines complete or cancel cleanly.

suspend fun drainConnections(clients: List<ClientSession>, deadline: Duration) {
    withTimeoutOrNull(deadline) {
        clients.forEach { it.sendEvent("reconnect", """{"reason":"deploy"}""") }
        clients.forEach { it.awaitDisconnect() }
    }
    // Force-close stragglers after deadline
    clients.forEach { it.close() }
}

Without this, Kubernetes will SIGTERM your pod, TCP connections reset, and users see a broken stream with no retry hint.

Structured concurrency does the heavy lifting

Every SSE connection runs inside a coroutineScope tied to the request lifecycle. When a client disconnects, the coroutine cancels. When the server drains, all child coroutines cancel cooperatively. No leaked coroutines, no zombie connections, no thread pool exhaustion.

This is where Kotlin’s structured concurrency model pays off compared to thread-per-connection or callback-based approaches. Cancellation propagates automatically through the entire coroutine tree, which means you don’t write cleanup logic. You just structure your scopes correctly and the runtime handles the rest.

What to actually do

Budget ~13-15 KB per SSE connection and run the memory math before choosing your container size. On 4GB, plan for 8K-10K connections max with comfortable headroom, and scale horizontally beyond that.

Use bounded channels (32-128 slots) per client with trySend for non-blocking fan-out. Drop or disconnect slow clients rather than letting unbounded buffers eat your heap. I know disconnecting clients feels aggressive, but the alternative is an OOM that disconnects everyone.

Implement connection draining from day one. Send a reconnect event, set a deadline, and lean on structured concurrency to guarantee cleanup. This will save you the first time you push a hotfix under load. Retrofitting it after an incident is miserable.

The architecture isn’t complex. It’s disciplined. Bounded buffers, predictable memory, cooperative cancellation. That’s what keeps your server running at 10K concurrent streams instead of melting under them.

Streaming LLM Tokens to 10K Concurrent Users: Backpressure, Coroutine Channels, and the SSE Fan-Out Architecture That Scales Without Melting Your Server

TL;DR

The problem: fan-out at token speed

Coroutine channels as the backbone

Why bounded channels beat unbounded queues

The memory math for a 4GB container

Connection draining for zero-downtime deploys

Structured concurrency does the heavy lifting

What to actually do

Related Posts

Streaming LLM Tokens to 10K Concurrent Users: Backpressure, Coroutine Channels, and the SSE Fan-Out Architecture That Scales Without Melting Your Server

Eliminating Android ANRs in Production: Strict Mode Traps, Binder Transaction Limits, and the Background Thread Architecture That Dropped Our ANR Rate From 2.1% to 0.08%

PostgreSQL Connection Pooling Under Pressure: PgBouncer Transaction Mode, Prepared Statement Workarounds, and the Pool Sizing Formula That Actually Works for Multi-Tenant SaaS