On-Device LLM Inference via KMP and llama.cpp: Memory-Mapped Model Loading, ANE/NNAPI Accelerator Delegation, and the Thermal Budget Patterns That Make 3B-Parameter Models Production-Ready on Mobile
TL;DR
You can run 3B-parameter models on flagship phones today, but only if you respect the memory, thermal, and accelerator constraints that make mobile a different beast from server inference. This post walks through a KMP shared module architecture wrapping llama.cpp via cinterop and JNI, covering mmap-based model loading, hardware accelerator delegation, quantization format selection, and the thermal throttling patterns that separate demos from production.
Why on-device, why now
Paul Graham once argued that a language’s power is partly defined by what it lets you do that others don’t. The same applies to platform capabilities. On-device inference gives you zero-latency responses, offline functionality, and data privacy guarantees no API call can match. We always knew mobile hardware would get here. The tooling just needed to catch up.
With llama.cpp’s maturity and KMP’s ability to share business logic across iOS and Android, it has. Here’s the architecture.
The KMP bridge: cinterop and JNI
The shared module exposes a single LlmEngine interface. On iOS, you bridge to llama.cpp through Kotlin/Native’s cinterop, generating Kotlin bindings from the C headers directly. On Android, you go through JNI with a thin C++ wrapper.
expect class LlmEngine {
fun loadModel(path: String, config: ModelConfig): Boolean
fun generate(prompt: String, params: GenerationParams): Flow<String>
fun currentThermalState(): ThermalState
}
The expect/actual pattern keeps your feature layer completely platform-agnostic. Your app code never touches llama.cpp directly.
Memory-mapped model loading: avoiding OOM kills
Most teams get model loading on mobile wrong. They try to read the entire model into heap memory. A Q4_K_M quantized 3B model is roughly 1.8-2.0 GB on disk. Loading that into the app’s memory space on a device with 6 GB total RAM is a guaranteed OOM kill.
The solution is mmap. llama.cpp supports memory-mapped file access natively, letting the OS page model weights in and out of physical RAM on demand. Your resident memory footprint stays manageable because the kernel evicts pages under pressure instead of killing your process.
Quantization: Q4_K_M vs Q5_K_S
Quantization format selection is a direct tradeoff between quality, speed, and memory pressure.
| Format | Model Size (3B) | Peak RAM | Tokens/sec (Pixel 8) | Tokens/sec (iPhone 15 Pro) | Perplexity Delta |
|---|---|---|---|---|---|
| Q4_K_M | ~1.8 GB | ~2.1 GB | ~12-15 t/s | ~18-22 t/s | +0.3-0.5 |
| Q5_K_S | ~2.2 GB | ~2.5 GB | ~9-12 t/s | ~14-18 t/s | +0.1-0.2 |
Q4_K_M is the sweet spot for mobile. The perplexity difference is negligible for most structured output tasks (JSON generation, classification, short-form text), and you gain meaningful headroom on both memory and throughput. Reserve Q5_K_S for use cases where output quality is non-negotiable and you can guarantee flagship hardware.
Hardware accelerator delegation
On iOS, you can delegate matrix operations to the Apple Neural Engine through CoreML integration. llama.cpp supports Metal acceleration out of the box, and ANE delegation via CoreML conversion can push throughput significantly higher on the A17/M-series silicon.
On Android, NNAPI delegation and GPU compute (via Vulkan or OpenCL) are available, though in my experience the gains are more variable across the fragmented Android device ecosystem. Pixel 8’s Tensor G3 handles GPU delegation well; mid-range Snapdragon chips can actually regress in performance with NNAPI due to driver overhead. Profile per-device and fall back to CPU gracefully.
Thermal throttling: the problem nobody demos
Sustained inference generates heat. After 60-90 seconds of continuous generation on most devices, you will hit thermal throttling, and your token rate can drop 40-60%.
The pattern that works: monitor thermal state through platform APIs (ProcessInfo.ThermalState on iOS, PowerManager.THERMAL_STATUS_* on Android), and implement adaptive generation. When thermal pressure rises, increase the delay between tokens or reduce n_predict. Same principle behind any sustained mobile workload. When I use HealthyDesk to remind me to take breaks during long coding sessions, it’s the same idea applied to humans: sustained output requires deliberate pacing.
when (currentThermalState()) {
ThermalState.NOMINAL -> params.copy(throttleMs = 0)
ThermalState.FAIR -> params.copy(throttleMs = 15)
ThermalState.SERIOUS -> params.copy(throttleMs = 50, nPredict = 128)
ThermalState.CRITICAL -> suspend generation, notify user
}
Structured output parsing
For app-integrated features, you need structured output, not freeform text. Use constrained grammar sampling (llama.cpp’s GBNF grammars) to force valid JSON output. Parse it in the shared KMP layer using kotlinx.serialization. This eliminates retry loops and makes on-device LLM output as reliable as any API response.
What to remember
-
Always use mmap for model loading. Never allocate model weights on the heap. Let the OS manage paging, and your app survives memory pressure instead of getting killed.
-
Default to Q4_K_M for mobile. The quality-to-resource tradeoff favors it on every metric that matters for on-device use. Only step up to Q5_K_S when you’ve confirmed hardware headroom and have a quality-critical use case.
-
Instrument thermal state from day one. Demos run for 10 seconds; production runs for minutes. Adaptive throttling based on real thermal data is what separates a prototype from a shippable feature.