MVP Factory
ai startup development

Mobile WebSocket tuning that stops silent message loss

KW
Krystian Wiewiór · · 5 min read

Meta description: A WebSocket reconnection state machine in Kotlin that took our delivery rate from ~94% to 99.97% on lossy mobile networks, with the actual latency numbers.

TL;DR

Most mobile WebSocket implementations silently drop 3-8% of messages during network transitions (doze mode, cell handoffs, app backgrounding). The root cause is almost always a mismatch between WebSocket-level ping/pong frames, TCP keep-alive timers, and mobile OS power states. I’ll walk through the reconnection state machine we built in Kotlin that brought our delivery rate from ~94% to 99.97% on lossy networks, with the actual latency numbers.


The silent failure most teams never measure

What most teams get wrong about WebSocket reliability on mobile: they test on WiFi, in the foreground, on a charged device. Production users are on congested LTE, walking into elevators, with battery saver enabled. The gap between those two worlds is enormous.

In my experience building production messaging systems, the first thing that breaks is not the connection itself. It’s your awareness that the connection is dead. A TCP socket can appear open for minutes after the actual network path has failed. This is the dead connection problem, and it’s where the interaction between three distinct keepalive mechanisms matters most.

MechanismLayerDefault IntervalWho SendsMobile OS Behavior
TCP Keep-AliveTransport2 hours (Linux)KernelSuspended in Doze mode
WebSocket Ping/PongApplicationNone (optional)App/ServerSuspended when app backgrounded
HTTP/Proxy TimeoutInfrastructure60-120sLoad balancerUnaware of mobile state

The mismatch is obvious. TCP keep-alive defaults to two hours, which is effectively useless for mobile. Meanwhile, your load balancer will kill an idle connection in 60 seconds. And both your app-level pings and TCP keepalives get suspended when Android enters Doze mode.

The result: your app thinks it’s connected, the server has already cleaned up the session, and messages land in a void.

The reconnection state machine

Naive retry logic (while(true) { connect(); delay(5000); }) is how you get thundering herds after an outage and duplicate message delivery during partial failures. You need a deterministic state machine.

enum class ConnectionState {
    DISCONNECTED,
    CONNECTING,
    CONNECTED,
    WAITING_FOR_RETRY,
    BACKING_OFF,
    DRAINING_QUEUE
}

The states most implementations miss are BACKING_OFF and DRAINING_QUEUE. When a reconnection succeeds, you can’t immediately resume normal operation. You must first drain any queued messages in order, confirming delivery of each before sending the next. Skipping this step is where that 3-8% loss hides.

Heartbeat tuning that actually works

Through production testing across ~200K daily active mobile connections, we converged on these intervals:

ParameterValueRationale
App-level ping interval25sBelow typical LB idle timeout (60s)
Ping timeout (pong expected)10sAggressive enough to detect dead connections
TCP keep-alive interval30sOverridden from 2h default via socket options
Initial reconnect delay500msFast enough for transient drops
Max backoff ceiling30sPrevents multi-minute gaps
Jitter range0-50% of delayPrevents thundering herd

The 25-second ping interval is deliberate. Many teams set this to 30 or 60 seconds, but intermediate proxies and carrier NAT tables can be surprisingly aggressive. We measured one major US carrier expiring NAT mappings at 28 seconds on their LTE network. That was a fun one to debug.

Exponential backoff with jitter

fun nextDelay(attempt: Int): Long {
    val exponential = minOf(
        MAX_BACKOFF_MS,
        INITIAL_DELAY_MS * 2.0.pow(attempt).toLong()
    )
    val jitter = (exponential * Random.nextDouble(0.0, 0.5)).toLong()
    return exponential + jitter
}

Without jitter, a server restart causes every client to reconnect at exactly the same intervals, creating predictable load spikes. In our load tests, removing jitter turned a 12-second recovery into a 45-second cascading failure. Don’t skip the jitter.

The trust angle: know what your dependencies do

This week’s jqwik incident, where a developer embedded a prompt injection in their library that instructed AI coding agents to delete application output, is a good reminder that hidden behaviors in dependencies cause real damage. The same principle applies to WebSocket libraries. Most OkHttp and Ktor WebSocket wrappers give you a clean API but leave keepalive configuration at OS defaults. Those defaults are designed for servers, not for a phone riding the subway. If you’re not explicitly configuring socket-level options, you’re trusting defaults that were never tuned for your environment.

Handling Android Doze and app backgrounding

When Android enters Doze mode, network access is batched into maintenance windows. Your ping timer fires, but the actual packet doesn’t leave the device. When the maintenance window opens, a stale ping goes out, the server has already timed you out, and you get a close frame. Or worse, nothing at all.

The fix: listen for ACTION_DEVICE_IDLE_MODE_CHANGED broadcasts and treat Doze entry as a controlled disconnect. Preemptively move to DISCONNECTED state, queue outbound messages, and reconnect immediately on Doze exit. This single change moved our measured delivery rate from 94.2% to 99.6%.

The remaining 0.37% came from proper DRAINING_QUEUE handling and server-side message deduplication using idempotency keys.

What to do about it

  1. Override TCP keep-alive at the socket level. The 2-hour default is useless on mobile. Set it to 30 seconds and pair it with a 25-second application-level ping.
  2. Build a state machine, not a retry loop. Include DRAINING_QUEUE as a first-class state and confirm delivery of buffered messages before resuming normal flow.
  3. Treat OS power states as network events. Proactively disconnect on Doze entry and reconnect on exit instead of waiting for timeout detection, which can take 30+ seconds and silently drop messages in the meantime.

Tags: kotlin, android, mobile, architecture, backend


Share: Twitter LinkedIn