Step 1: Run with tracing to capture loaded classes

TL;DR

JVM cold starts have been the tax Kotlin developers pay for choosing serverless. Three approaches now compete to eliminate that tax: AWS SnapStart (Firecracker VM snapshots), CRaC (Coordinated Restore at Checkpoint), and GraalVM native image. Each carries Kotlin-specific gotchas: stale lazy delegates, zombie coroutine dispatchers, and serialization landmines. Combined with an AppCDS archive pipeline, you can consistently land under 200ms cold starts without abandoning the JVM. What follows is what works, what breaks, and what the numbers actually say.

The cold start problem in numbers

In my experience building production serverless systems, a vanilla Kotlin Lambda on a standard JVM runtime routinely hits 3-6 seconds on cold start. That time breaks down roughly like this:

Phase	Typical duration
Container init + JVM bootstrap	~800-1500ms
Class loading	~1000-2500ms
Dependency injection / framework init	~500-2000ms
Handler first invocation	~100-300ms

For an API Gateway-backed function with a 29-second timeout, burning 5 seconds before your code even runs is a non-starter. Class loading and framework initialization dominate. Every approach below attacks those two phases differently.

Three approaches compared

AWS SnapStart: snapshot the warm JVM

SnapStart takes a Firecracker microVM snapshot after your Lambda’s init phase completes — classes loaded, singletons initialized, connection pools warmed. On cold start, it restores from that snapshot instead of replaying initialization.

In practice, cold starts drop to the 200-400ms range for typical Kotlin workloads. The tradeoff is that your snapshot is a point-in-time freeze. Anything stateful at init time (randomness seeds, ephemeral credentials, open sockets) resurrects as stale state.

CRaC: checkpoint/restore with application cooperation

CRaC, an OpenJDK project, takes a similar snapshot approach but gives your application explicit lifecycle hooks (beforeCheckpoint / afterRestore). You register Resource implementations that clean up and reinitialize state around the checkpoint boundary.

You get more control than SnapStart, but you own the orchestration. CRaC also requires a compatible JDK build (Azul Zulu with CRaC support, or the upstream OpenJDK CRaC branch), which limits deployment flexibility. Expect 150-350ms cold starts.

GraalVM native image: compile it all away

Native image eliminates the JVM entirely by ahead-of-time compiling your Kotlin bytecode to a platform-specific binary. Class loading overhead drops to near zero.

Sub-100ms cold starts are achievable. But the cost is steep: reflection-heavy frameworks need extensive configuration, binary size grows, and build times can exceed 5 minutes for non-trivial applications. I’d only reach for this when the other two options genuinely aren’t fast enough.

How they stack up

Factor	SnapStart	CRaC	GraalVM native
Cold start (typical)	200-400ms	150-350ms	50-150ms
Build complexity	Low	Medium	High
Kotlin coroutines support	Partial (gotchas)	Partial (needs hooks)	Limited (reflection config)
Framework compatibility	Broad	Moderate	Narrow
Memory footprint	Standard JVM	Standard JVM	50-70% reduction
AWS Lambda support	Native	Custom runtime	Custom runtime

Kotlin-specific gotchas that will bite you

Most teams underestimate how Kotlin’s idioms interact with checkpoint/restore. These are the ones I’ve seen cause real production incidents.

1. `lazy` delegates restore stale state

val config: Config by lazy { loadFromSSM() } // Loaded at init, frozen in snapshot

After a SnapStart or CRaC restore, that lazy value is already initialized — with credentials or config that may have rotated since the snapshot was taken. The fix: use a ResettableLazy wrapper or, for CRaC, implement Resource to invalidate lazy holders in afterRestore.

2. Coroutine dispatcher pools die on restore

Dispatchers.Default and Dispatchers.IO maintain thread pools that don’t survive a checkpoint cleanly. After restore, threads in the pool may be in an undefined state. In practice, this manifests as coroutines that silently hang.

// Before checkpoint: warm pool of 64 IO threads
// After restore: pool references dead threads
val result = withContext(Dispatchers.IO) { 
    // May hang indefinitely
    fetchData()
}

The workaround: reinitialize dispatchers post-restore, or use a custom CoroutineDispatcher backed by a fresh executor created in afterRestore. Neither option is pretty, but a silently hanging Lambda is worse.

3. Kotlin serialization and reflection caches

kotlinx.serialization builds internal caches of serializer lookups. GraalVM native image needs these registered at build time. Miss one, and you get a runtime ClassNotFoundException that only appears in production under specific payload shapes. The kind of bug that passes every test and explodes on the first unusual request.

The AppCDS pipeline that ties it all together

Application Class Data Sharing generates a shared archive of pre-parsed class metadata. Combined with SnapStart, it eliminates the class loading phase almost entirely.

# Step 1: Run with tracing to capture loaded classes
java -XX:DumpLoadedClassList=classes.lst -jar app.jar

# Step 2: Generate the CDS archive
java -Xshare:dump -XX:SharedClassListFile=classes.lst \
     -XX:SharedArchiveFile=app-cds.jsa -jar app.jar

# Step 3: Run with the archive
java -Xshare:on -XX:SharedArchiveFile=app-cds.jsa -jar app.jar

Integrating this into CI is where teams stall. Your integration tests already take ten, twenty, thirty minutes per push. Adding an AppCDS generation step stretches that further. The pragmatic move: generate the archive in a dedicated CI stage that only runs when dependencies change, not on every commit. Cache the .jsa file as a build artifact.

What to actually do

Start with SnapStart + AppCDS if you’re on AWS Lambda. It’s the lowest-effort path to sub-300ms cold starts and requires zero custom runtime work. But audit every lazy delegate and singleton for stale state — this is the part people skip and regret.

If your Kotlin service manages connection pools, coroutine dispatchers, or cached credentials, move to CRaC. Its explicit lifecycle hooks prevent the class of bugs that SnapStart introduces silently. You’ll write more boilerplate, but you’ll also sleep better.

Reserve GraalVM native image for cold-start-critical, framework-light functions. The build complexity and compatibility constraints are only justified when you absolutely need sub-100ms starts and can commit to maintaining reflection configuration as your codebase evolves. Most teams don’t need this, and that’s fine.

The JVM cold start problem isn’t unsolvable — it’s an engineering tradeoff. Pick the approach that matches your team’s operational maturity, not the one with the most impressive benchmark slide.

TAGS: kotlin, serverless, backend, cloud, architecture

Step 1: Run with tracing to capture loaded classes

TL;DR

The cold start problem in numbers

Three approaches compared

AWS SnapStart: snapshot the warm JVM

CRaC: checkpoint/restore with application cooperation

GraalVM native image: compile it all away

How they stack up

Kotlin-specific gotchas that will bite you

1. `lazy` delegates restore stale state

2. Coroutine dispatcher pools die on restore

3. Kotlin serialization and reflection caches

The AppCDS pipeline that ties it all together

What to actually do

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol

Step 1: Run with tracing to capture loaded classes

TL;DR

The cold start problem in numbers

Three approaches compared

AWS SnapStart: snapshot the warm JVM

CRaC: checkpoint/restore with application cooperation

GraalVM native image: compile it all away

How they stack up

Kotlin-specific gotchas that will bite you

1. lazy delegates restore stale state

2. Coroutine dispatcher pools die on restore

3. Kotlin serialization and reflection caches

The AppCDS pipeline that ties it all together

What to actually do

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol

1. `lazy` delegates restore stale state