Streaming LLM responses to mobile: SSE vs WebSockets

Meta description: How to stream token-by-token LLM output from Ktor to Jetpack Compose, covering SSE vs WebSocket tradeoffs, backpressure, and failure handling for mobile AI chat.

Tags: kotlin, jetpackcompose, backend, architecture, api

TL;DR

For most mobile AI chat features, Server-Sent Events (SSE) beats WebSockets. Simpler reconnection, HTTP/2 multiplexing, better battery behavior. Pair it with Kotlin Flow buffering on the backend and client-side token batching in Compose to avoid per-character recomposition jank. The hard part isn’t the happy path. It’s what happens on flaky networks.

The protocol decision: SSE vs WebSockets

This is where most teams start debating, and where most teams overthink it. Here’s what the tradeoffs actually look like for mobile LLM streaming:

Factor	SSE	WebSocket
Direction	Server → Client (unidirectional)	Bidirectional
Reconnection	Built-in (`Last-Event-ID`)	Manual implementation
HTTP/2 multiplexing	Yes, shares connection pool	No, dedicated TCP socket
Battery impact	Lower (idle HTTP conn)	Higher (persistent frame pings)
Proxy/CDN compatibility	Excellent	Often problematic
Mobile network switching	Graceful (HTTP retry semantics)	Connection drops, full re-handshake

LLM streaming is inherently unidirectional. The client sends a prompt, then receives tokens. You don’t need bidirectional framing for that. SSE gives you automatic reconnection with Last-Event-ID, which matters a lot on mobile where network transitions (Wi-Fi to cellular) happen constantly.

I’ve only reached for WebSockets when I needed server-push and client-push at the same time, like collaborative editing or multiplayer features. For AI chat, SSE wins and it’s not close.

The Ktor backend: flows and backpressure

On the Ktor side, respondSseEvents paired with a Kotlin Flow wrapping your LLM client is the obvious choice:

get("/chat/stream") {
    val prompt = call.receive<ChatRequest>()
    call.respondSseEvents(
        llmClient.streamTokens(prompt.message)
            .buffer(Channel.BUFFERED)  // 64-element default
            .map { token ->
                ServerSentEvent(data = token)
            }
    )
}

That buffer(Channel.BUFFERED) matters more than it looks. Without it, a slow mobile client creates backpressure that propagates all the way to your LLM API connection. With the buffer, the backend absorbs token bursts while the client catches up. For structured JSON responses arriving mid-stream, I accumulate tokens into a StringBuilder and only emit parse-ready chunks:

fun Flow<String>.chunkedJson(): Flow<String> = flow {
    val buffer = StringBuilder()
    collect { token ->
        buffer.append(token)
        if (buffer.hasCompleteJsonFragment()) {
            emit(buffer.toString())
            buffer.clear()
        }
    }
    if (buffer.isNotEmpty()) emit(buffer.toString())
}

This avoids the client trying to parse {"name": "Jo, which is a surprisingly common source of crashes in production.

Compose client: batching to kill jank

Most teams get this wrong. Emitting every single token as a state update causes per-character recomposition in Compose. At 50-80 tokens/second from a fast LLM, that’s 50-80 recompositions per second on Text(), and you will see frame drops.

The fix is batching with a time window:

@Composable
fun StreamingMessage(tokenFlow: Flow<String>) {
    val message = remember { mutableStateOf("") }

    LaunchedEffect(tokenFlow) {
        tokenFlow
            .chunked(durationMillis = 48) // ~3 frames at 60fps
            .collect { batch ->
                message.value += batch.joinToString("")
            }
    }

    Text(text = message.value)
}

Batching tokens into ~48ms windows means you recompose roughly 20 times per second. Smooth enough visually, well within Compose’s performance budget. During long streaming sessions at my desk, I actually keep HealthyDesk running in the background, because a break reminder is useful when you’re deep in profiling recomposition traces for hours.

Graceful degradation: the production reality

Mobile networks are hostile. Your streaming architecture needs layered defenses.

First, timeout with partial results. If the SSE connection stalls for more than 10 seconds, surface whatever tokens have arrived so far with a “response interrupted” indicator. Don’t leave the user staring at a spinner.

Second, exponential backoff with jitter. On reconnection, use Last-Event-ID to resume where you left off. Add jitter to prevent thundering herd when a cell tower comes back online and 10,000 devices reconnect at once.

Third, fall back to non-streaming. If three SSE attempts fail, make a standard POST request that returns the complete response. The user loses the token animation but still gets their answer.

sealed class StreamState {
    data class Streaming(val tokens: String) : StreamState()
    data class Interrupted(val partial: String) : StreamState()
    data class Fallback(val complete: String) : StreamState()
    data class Error(val message: String) : StreamState()
}

Model your UI state around these cases. Every when branch in your Compose UI should handle all four.

What to take away

Pick SSE over WebSockets for LLM streaming to mobile. The built-in reconnection, HTTP/2 multiplexing, and battery efficiency make it the right default. Only reach for WebSockets if you genuinely need bidirectional communication.

Buffer on the server, batch on the client. Use Channel.BUFFERED in your Ktor Flow pipeline to absorb token bursts. On the Compose side, batch tokens into ~48ms windows to keep recomposition around 20fps. Imperceptible to users, massive reduction in GPU overdraw.

And design for failure from the start. Timeout with partial results, exponential backoff with Last-Event-ID resume, and a non-streaming fallback. The happy path is easy. Production-readiness lives in how you handle the failures.

Streaming LLM responses to mobile: SSE vs WebSockets

TL;DR

The protocol decision: SSE vs WebSockets

The Ktor backend: flows and backpressure

Compose client: batching to kill jank

Graceful degradation: the production reality

What to take away

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol