Embedding Local LLMs in Your Mobile App: llama.cpp via KMP, 4-Bit Quantization Tradeoffs, and the Streaming Architecture That Keeps Your UI at 60fps
TL;DR
You can ship a working LLM inside your mobile app today using llama.cpp with Kotlin Multiplatform bindings. Q4_K_M quantization is the right choice for most production use cases: 80-90% of full-model quality at one-third the memory footprint. The hard part isn’t inference itself but building a streaming token pipeline that keeps your UI at 60fps. This post covers model selection, quantization tradeoffs with real benchmarks, GPU delegation on both platforms, and a coroutine-based architecture I’ve used in production.
Why on-device inference matters now
Cloud LLM calls add 200-800ms of latency, require connectivity, and create data privacy problems that are hard to solve in regulated industries. llama.cpp has gotten good. 7B-parameter models fit in 4GB of RAM. On-device inference works in production now.
For me, the turning point was GGUF format stabilization and Metal/NNAPI GPU delegation becoming reliable enough to actually ship.
Model selection and quantization: the numbers
Most teams screw up quantization in one of two ways: they over-optimize for size (Q2_K, quality is terrible) or refuse to quantize at all (F16, won’t fit on any phone). The benchmarks make the choice obvious.
Benchmarks: Mistral 7B on iPhone 15 Pro (6GB RAM) and Pixel 8 Pro (12GB RAM)
| Quantization | Model Size | Peak RAM | Tokens/sec (iOS Metal) | Tokens/sec (Android NNAPI) | Perplexity (wiki2) |
|---|---|---|---|---|---|
| F16 | 14.5 GB | OOM | - | - | 5.79 |
| Q8_0 | 7.7 GB | OOM | - | 8.2 | 5.80 |
| Q5_K_S | 5.1 GB | 5.8 GB | 18.4 | 14.1 | 5.86 |
| Q4_K_M | 4.4 GB | 4.9 GB | 22.7 | 17.3 | 5.92 |
| Q4_0 | 3.8 GB | 4.3 GB | 24.1 | 19.8 | 6.18 |
| Q2_K | 2.7 GB | 3.2 GB | 28.3 | 22.6 | 6.97 |
Q4_K_M is what you should ship. You lose ~2% perplexity versus Q5_K_S while gaining 23% faster inference on iOS and staying comfortably under the 5GB dirty memory limit that triggers iOS jetsam kills. Q5_K_S works on flagship Android devices with 12GB+ RAM, but on iOS it leaves you no headroom.
Memory-mapped loading: staying alive on iOS
iOS enforces hard dirty memory limits. Exceed them and your app dies without warning. The fix is mmap-based model loading, which llama.cpp supports natively. Memory-mapped pages count as clean memory (backed by the file on disk), not dirty memory. This is the difference between an app that ships and one that crashes.
Your KMP expect/actual layer should configure this explicitly:
// commonMain
expect class LlamaModel {
fun load(path: String, config: ModelConfig): InferenceSession
}
data class ModelConfig(
val useMmap: Boolean = true,
val useGpu: Boolean = true,
val gpuLayers: Int = 99, // offload all layers
val contextSize: Int = 2048
)
On iOS, the actual implementation calls llama.cpp’s C API via cinterop with use_mmap = true. On Android, JNI bindings do the same. Setting gpuLayers = 99 offloads everything possible to Metal or NNAPI. In practice this means 28-32 of 32 layers on recent devices, with embedding and output layers staying on CPU.
The streaming architecture that keeps you at 60fps
Token generation runs at 17-25 tokens/sec. If you collect tokens on the main thread or batch UI updates naively, you will drop frames.
fun streamInference(prompt: String): Flow<String> = callbackFlow {
val session = model.createSession()
session.onToken { token ->
trySend(token) // non-blocking, drops if buffer full
}
session.infer(prompt)
close()
awaitClose { session.cancel() }
}
// In your ViewModel
viewModelScope.launch {
streamInference(prompt)
.buffer(Channel.CONFLATED)
.collect { token ->
_uiState.update { it.copy(text = it.text + token) }
}
}
callbackFlow bridges the C callback into coroutine-land. Channel.CONFLATED means that if the UI can’t keep up during layout recomposition, it coalesces tokens instead of building backpressure. In Compose, this triggers recomposition per token batch, and Compose’s smart diffing keeps frame time under 12ms in my profiling.
For long sessions, I run inference on Dispatchers.Default with a dedicated single-thread context to avoid contention with other coroutines. llama.cpp isn’t thread-safe per session.
GPU delegation in practice
Metal on iOS is mature and gives a consistent 1.3-1.5x speedup over CPU-only inference. NNAPI on Android is messier. Qualcomm Adreno GPUs handle it well, but I’ve seen regressions on older Mali GPUs. My recommendation: default to GPU on iOS, and on Android, run a 10-token benchmark at first launch to decide. Cache the result.
This kind of adaptive initialization shows up everywhere in mobile. Background-aware resource management matters for any app doing real work on-device; your LLM inference needs to yield gracefully when your app is backgrounded.
Takeaways
-
Ship Q4_K_M. It’s the best balance of quality, speed, and memory safety across both platforms. Only go with Q5_K_S if you’re Android-only targeting flagships.
-
Always use mmap model loading on iOS. Without it, you’ll hit jetsam limits on any model above 3GB. Validate dirty memory footprint with Xcode Memory Gauge, not just Instruments allocations.
-
Stream tokens through a conflated coroutine channel. Don’t batch, don’t poll, and don’t collect on the main dispatcher. Let Compose’s recomposition handle the rendering cadence. Your job is to never block the frame.
On-device LLM inference works. The tooling is there. What separates apps that ship from apps that crash is the boring stuff: memory management, threading discipline, and knowing where iOS and Android disagree. Get those right and you can do things your cloud-dependent competitors can’t.