Running LLMs On-Device in Android: GGUF Models, NNAPI, and the Real Performance Tradeoffs
TL;DR
Running LLMs on-device in Android is viable today, but only if you pick the right quantization format, manage memory aggressively on mid-range hardware, and measure latency the way users actually experience it. GGUF Q4_K_M is the format I’d recommend for most production apps: ~1.5 GB RAM for a 3B parameter model, 8-12 tokens/sec on Snapdragon 8 Gen 2, and acceptable quality degradation. NNAPI delegation sounds great on paper but introduces unpredictable latency variance across OEMs. I learned all of this shipping to 200K+ devices.
Why on-device matters now
Paul Graham wrote about superlinear returns, how in certain domains effort compounds rather than scaling linearly. On-device inference works like that for mobile. Every millisecond you shave off round-trip latency doesn’t just improve UX linearly; it unlocks entirely new interaction patterns. Autocomplete at 50ms feels like typing. At 500ms, it feels like waiting. That gap between cloud and on-device isn’t a speed difference. It’s a different product.
Model quantization: choosing your format
The quantization format decision is the single highest-leverage choice you’ll make. I’ve tried most of them in production. The major options compare like this on a Pixel 8 Pro (Tensor G3) running a 3B parameter LLaMA-class model:
| Format | Model Size | RAM Usage | Tokens/sec | Quality (Perplexity) | Cold Start |
|---|---|---|---|---|---|
| FP16 (baseline) | 6.0 GB | 7.2 GB | 2.1 | 8.2 | 14.3s |
| GGUF Q8_0 | 3.2 GB | 4.1 GB | 5.4 | 8.4 | 8.1s |
| GGUF Q4_K_M | 1.7 GB | 2.1 GB | 11.2 | 8.9 | 4.2s |
| GGUF Q4_0 | 1.5 GB | 1.9 GB | 12.8 | 9.6 | 3.8s |
| QLoRA INT4 | 1.6 GB | 2.3 GB | 9.1 | 9.1 | 5.7s |
Q4_K_M delivers 5x the throughput of FP16 with only an 8.5% perplexity increase. That’s a good trade. Q4_0 is slightly faster but the quality cliff is real: users in our A/B tests reported more nonsensical completions. The perplexity numbers don’t fully capture how bad those feel in practice.
NNAPI and GPU delegates: the OEM fragmentation problem
Most teams get this wrong the same way. They benchmark on a Pixel and ship to the world. NNAPI is an abstraction layer, and the quality of vendor implementations varies wildly.
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(256)
.setPreferredBackend(Backend.GPU) // Seems simple, right?
.build()
What this actually does depends entirely on the SoC:
| Chipset | GPU Delegate | NNAPI Status | P95 Latency Variance |
|---|---|---|---|
| Snapdragon 8 Gen 2 | Adreno 740, solid | Stable | ±12% |
| Tensor G3 | Mali-G715, good | Stable | ±15% |
| Dimensity 9200 | Mali-G715, good | Partial ops | ±38% |
| Exynos 2400 | Xclipse 940, inconsistent | Unstable | ±52% |
| Snapdragon 6 Gen 1 | Adreno 710, CPU fallback frequent | Partial | ±61% |
That P95 variance on Exynos and mid-range Snapdragon will wreck your user experience. I’d go with a tiered strategy:
fun selectBackend(chipset: ChipsetInfo): Backend {
return when {
chipset.isSnapdragon8Series() -> Backend.GPU
chipset.isTensorG3OrNewer() -> Backend.GPU
chipset.totalRamGb >= 8 -> Backend.CPU // 4 threads, predictable
else -> Backend.CPU // 2 threads, conservative
}
}
Falling back to CPU with controlled thread counts gives you worse peak throughput but far better latency consistency. Users perceive variance as jank. They’re much more forgiving of steady-but-slower output than they are of unpredictable stuttering.
Memory pressure: the mid-range reality
63% of Android devices globally have 6 GB of RAM or less. After the OS, launcher, and background services take their share, you’re often working with 1.5-2 GB of available memory. That changes everything about how you load and manage models.
What actually works in production:
- Memory-map the model file instead of loading it entirely into RAM. GGUF supports mmap natively, which lets the OS page in weights on demand.
- Monitor
onTrimMemoryaggressively. Release KV cache atTRIM_MEMORY_RUNNING_LOWand unload the model entirely atTRIM_MEMORY_COMPLETE. - Pre-warm selectively. Load the model when the user navigates to the relevant feature, not at app start. Eager loading sounds smart until you’re fighting the OS for memory before the user even needs inference.
Benchmarking that reflects reality
Synthetic throughput benchmarks (tokens/sec on a fresh device, nothing else running) are misleading. They’ll make you feel great and then your users will tell you the app is slow. Measure these instead:
- Time-to-first-token (TTFT), which is what users actually wait for. Target under 400ms.
- P95 latency, not mean. One bad inference ruins the experience.
- Thermal throttle recovery. After 60 seconds of continuous inference, throughput drops 20-40% on most devices. Your benchmark needs to capture that tail.
- Memory-pressure scenarios. Run benchmarks with YouTube and Chrome in the background. That is what your users’ phones actually look like.
What I’d actually do
Use GGUF Q4_K_M as your default quantization. It’s the best balance of size, speed, and quality for 3B models on mobile that I’ve found. Only go to Q8 if your use case demands near-baseline accuracy.
Don’t trust NNAPI across OEMs. Build a chipset allowlist for GPU delegation and default to CPU inference with controlled threading everywhere else. Predictable latency beats peak throughput every time.
Benchmark under memory pressure on mid-range hardware. Your Pixel 9 Pro test results are irrelevant to 60%+ of your users. Run your benchmarks on a Redmi Note 13 with Spotify playing. That’s your real performance floor.