MVP Factory
ai startup development

Kotlin Coroutine Structured Concurrency Pitfalls in Production: SupervisorScope, Exception Propagation, and the Cancellation Architecture That Prevents Silent Data Loss

KW
Krystian Wiewiór · · 5 min read

TL;DR

Structured concurrency in Kotlin coroutines is more than “launch and forget with extra steps.” In production, the difference between coroutineScope and supervisorScope determines whether a single failing child nukes your entire operation or fails in isolation. Catching CancellationException, even accidentally, breaks the cancellation propagation tree and causes silent data loss. This post covers the exact failure modes, how Job hierarchies interact with Retrofit, Room, and Ktor, and the cancellation-safe patterns that have saved us from partial writes across backend and Android systems.


The exception propagation model most teams get wrong

Most teams treat coroutineScope and supervisorScope as interchangeable wrappers. They aren’t. They’re fundamentally different cancellation architectures.

BehaviorcoroutineScopesupervisorScope
Child failure propagationCancels all siblings + parentFails only the failed child
Use caseAll-or-nothing operationsIndependent parallel tasks
Exception surfacingThrows immediately to callerMust handle per-child
Partial completion riskNone (atomic)Yes, by design

Roughly 60-70% of coroutine-related bugs I catch in code reviews trace back to using coroutineScope where supervisorScope was needed, or the reverse. One backend service processing ~50K events/hour saw cascade failures drop by 94% after switching a fan-out pipeline from coroutineScope to supervisorScope. A single malformed event had been killing its entire batch.

// WRONG: One bad enrichment kills all siblings
coroutineScope {
    events.map { event ->
        async { enrichAndStore(event) }
    }.awaitAll()
}

// RIGHT: Isolate independent event processing
supervisorScope {
    events.map { event ->
        async {
            runCatching { enrichAndStore(event) }
                .onFailure { logger.error("Failed: ${event.id}", it) }
        }
    }.awaitAll()
}

The CancellationException trap

This one’s a silent killer. A generic catch (e: Exception) block swallows CancellationException, which tells the coroutine runtime “I’m fine, keep going.” Your coroutine tree is now broken. The parent thinks the child is still running, cleanup hooks don’t fire, and you get partial writes with no error logs.

// DANGEROUS: Silently breaks cancellation propagation
try {
    repository.saveAll(records)
} catch (e: Exception) {
    // CancellationException caught here — tree is now broken
    logger.error("Save failed", e)
}

// CORRECT: Always rethrow CancellationException
try {
    repository.saveAll(records)
} catch (e: CancellationException) {
    throw e // preserve the cancellation contract
} catch (e: Exception) {
    logger.error("Save failed", e)
}

I’ve measured this directly: in an Android app with Room database writes, swallowed CancellationException during ViewModel.onCleared() caused ~3% of writes to commit partially without any error signal. Users saw stale or corrupted state with zero crash reports. Silent data loss with no observability. The worst kind of bug.

How Retrofit, Room, and Ktor interact with job cancellation

Library integration is where structured concurrency gets tricky. Each framework cooperates with cancellation differently:

LibraryCancellation behaviorRisk
Retrofit (suspend)Cancels underlying OkHttp callIn-flight request abandoned; server may still process it
Room (suspend)Transaction rolled back on cancellationSafe, but @Transaction scope matters
Ktor ClientCancels and closes connectionResponse body partially read if mid-stream
Ktor ServerRequest scope cancelled on client disconnectHandler must check isActive for long operations

For any I/O that must complete (flushing analytics, writing state, sending acknowledgements), use withContext(NonCancellable):

suspend fun processAndAcknowledge(message: Message) {
    val result = process(message) // cancellable
    
    // This MUST complete even if the parent scope is cancelled
    withContext(NonCancellable) {
        database.markProcessed(message.id)
        messageQueue.acknowledge(message.deliveryTag)
    }
}

Use this sparingly. Every NonCancellable block is a contract that says “this will outlive its parent scope,” so keep it tight: idempotent cleanup and acknowledgements only.

The pattern that prevents silent failures

Here’s the architecture I use for any write path where data loss matters:

  1. Pick the right scope. coroutineScope for all-or-nothing, supervisorScope for independent fan-out. There’s no safe default here; you have to think about what partial failure means for your specific case.
  2. Never swallow CancellationException. Rethrow it explicitly before any other catch block.
  3. Wrap mandatory cleanup in withContext(NonCancellable). Database acks, queue commits, metric flushes. Keep these blocks small.
  4. Make operations idempotent. Assume every write may execute twice under cancellation races, because sometimes it will.

This applies whether you’re building a Ktor backend processing event streams or an Android app syncing user data. On the Android side specifically, long-running coroutine work in viewModelScope gets cancelled on configuration changes more often than most developers realize. I use tools like HealthyDesk to force regular breaks during deep debugging sessions; ironically, stepping away is often when the cancellation race condition finally clicks in my head.

What to do right now

  1. Audit every catch (e: Exception) in coroutine code. Add an explicit catch (e: CancellationException) { throw e } before it. This single change fixes the most common class of silent coroutine failures.

  2. Default to coroutineScope and opt into supervisorScope deliberately. Atomic failure is safer than partial completion. Only reach for supervisorScope when you’ve designed each child to handle its own failure and the parent can tolerate partial results.

  3. Wrap mandatory completion logic in withContext(NonCancellable) and keep it idempotent. If a database write or queue acknowledgement has to happen, protect it, but make sure it’s safe to execute twice. Under cancellation races, it might.


Share: Twitter LinkedIn