Ktor Connection Pooling with Coroutine-Per-Request: HikariCP Tuning, Connection Leak Detection, and the Dispatcher Architecture That Handles 50K RPM on a Single $20 VPS
TL;DR
Ktor’s coroutine-per-request model and JDBC’s thread-blocking nature create a mismatch that causes thread starvation under load. The fix: a dedicated limitedParallelism dispatcher sized to your HikariCP pool, leak detection tuned for coroutine suspension times, and pool sizing derived from PostgreSQL’s connection formula (adapted for SSDs). I applied this to a production Ktor service on a single $20 VPS and went from request timeouts at 12K RPM to stable throughput at 50K RPM.
The problem: coroutines lie about concurrency
Most teams get this wrong about Ktor and JDBC: they assume that because coroutines are lightweight, you can fire thousands of database calls concurrently. You can’t. JDBC is a blocking API. Every connection.prepareStatement().executeQuery() call blocks the underlying thread until the database responds.
When you use Dispatchers.IO (backed by 64 threads by default) and your HikariCP pool has 10 connections, you get a mismatch that will bite you:
| Resource | Default count | What happens under load |
|---|---|---|
| Coroutines in flight | Thousands | All try to acquire a connection |
Dispatchers.IO threads | 64 | 64 threads block waiting for connections |
| HikariCP connections | 10 | Only 10 threads actually make progress |
| Remaining 54 threads | Blocked | Starved from handling non-DB IO work |
The result: your file reads, HTTP client calls, and other IO operations stall because 54 threads are parked waiting for a database connection they won’t get for hundreds of milliseconds. I’ve seen this kill more production Ktor services than any other misconfiguration.
The fix: a limited-parallelism dispatcher
Create a dedicated dispatcher that caps database concurrency to match your connection pool size:
object DatabaseDispatcher {
val dispatcher = Dispatchers.IO.limitedParallelism(12)
}
suspend fun <T> dbQuery(block: () -> T): T =
withContext(DatabaseDispatcher.dispatcher) {
block()
}
At most 12 threads ever block on JDBC calls. The rest of Dispatchers.IO stays free for actual IO work.
Pool sizing: the PostgreSQL formula, adapted
PostgreSQL’s canonical formula is:
connections = (cores * 2) + effective_spindle_count
This predates SSDs. The effective_spindle_count term models rotational disk latency, the time a thread spends waiting on physical disk seeks. With SSDs, IO wait drops by an order of magnitude, so the term effectively becomes 0 or 1. Use the formula as a starting point, then benchmark.
On a $20 VPS (typically 2 vCPUs, SSD):
connections = (2 * 2) + 1 = 5
That feels low, but PostgreSQL’s own benchmarks confirm that a small pool with queued requests outperforms a large pool with contention. I run 10-12 connections because my workload includes some longer analytical queries, but the principle holds: more connections does not mean more throughput.
val hikariConfig = HikariConfig().apply {
maximumPoolSize = 12
minimumIdle = 4
idleTimeout = 600_000 // 10 minutes
connectionTimeout = 3_000 // fail fast at 3s
maxLifetime = 1_800_000 // 30 minutes
leakDetectionThreshold = 8_000 // tuned for coroutines
}
Leak detection for suspended coroutines
HikariCP’s default leak detection threshold is 0 (disabled). Most guides recommend 2,000ms. In a coroutine context, that fires false positives constantly. A coroutine holding a connection can be suspended (yielding its thread) while waiting on downstream logic, legitimately holding the connection for 3-5 seconds.
I set leakDetectionThreshold = 8000 after profiling actual connection hold times:
| Percentile | Connection hold time | Notes |
|---|---|---|
| p50 | 12ms | Simple CRUD |
| p90 | 85ms | Joins + serialization |
| p99 | 2,400ms | Complex aggregations |
| p99.9 | 5,100ms | Coroutine suspended mid-transaction |
An 8-second threshold catches genuine leaks (a forgotten connection.close() in an error path) without alerting on legitimate coroutine suspension.
Production results
Real metrics from a Ktor service running on a Hetzner CX22 (2 vCPU, 4GB RAM, $20/mo).
I load tested with k6 over 15-minute sustained runs at incrementally increasing concurrency (50, 100, 200, 400 virtual users). The workload was roughly 80% reads (single-row lookups, paginated lists) and 20% writes (inserts and updates). Error rate at 50K RPM was 0.02%, all transient connection resets, no application errors. Each configuration was tested three times; the median run is reported.
| Metric | Before (naive) | After (tuned) |
|---|---|---|
| Max stable RPM | ~12,000 | ~50,000 |
| p99 latency | 4,200ms | 180ms |
| HikariCP wait time (p99) | 2,800ms | 35ms |
| IO thread starvation events/hr | ~340 | 0 |
| Connection leaks detected/day | 12 (false positives) | 0 |
All I did was constrain concurrency to match the actual bottleneck. The limited-parallelism dispatcher acts as a backpressure mechanism: when all 12 slots are occupied, the 13th coroutine suspends without blocking a thread and resumes when a slot opens. No extra hardware. Just less fighting with the concurrency model.
Beyond databases
This same pattern, matching dispatcher parallelism to pool size, applies to any shared resource with fixed capacity. HTTP client pools, rate-limited APIs, file handle limits. Your $20 VPS has more headroom than you’d expect once you stop over-subscribing bounded resources.
Takeaways
-
Create a dedicated database dispatcher using
Dispatchers.IO.limitedParallelism(n)wherenmatches your HikariCPmaximumPoolSize. Don’t let unbounded coroutines compete for a bounded connection pool on the general IO dispatcher. -
Size your pool using the PostgreSQL formula as a baseline, not intuition. On a 2-core VPS, 10-12 connections is the ceiling. A larger pool increases lock contention inside PostgreSQL and actually reduces throughput. The original formula’s spindle term is an artifact of rotational storage; benchmark on SSDs to find your true ceiling.
-
Set leak detection to 4-8x your p99 connection hold time. Profile first, then configure. The default disabled setting hides real leaks, and the commonly recommended 2-second threshold floods logs with false positives in coroutine-heavy services.