MVP Factory
ai startup development

Running Gemma 3 on-device: memory, quantization, and KMP

KW
Krystian Wiewiór · · 5 min read

Meta description: Ship on-device LLMs with Gemma 3 — 4-bit vs 8-bit quantization tradeoffs, memory management on mid-range phones, and a shared KMP inference layer for Android and iOS.

Tags: kotlin, kmp, multiplatform, android, ios


TL;DR

Shipping Gemma 3 on-device works today, but only if you respect the memory ceiling of mid-range hardware. 4-bit quantization (Q4_K_M) fits inside a 4GB RAM budget with acceptable quality loss for most generative tasks. 8-bit is a luxury reserved for flagships. We built a shared KMP inference layer using expect/actual over MediaPipe LLM Inference API on Android and CoreML on iOS, sharing prompt orchestration, token streaming, and memory management across both platforms. This is what we learned shipping it to production.


The memory wall is real

Paul Graham wrote that one principle of making new things is to “start with something small and get it working.” With on-device LLMs, “small” is not optional. It is the entire constraint.

Gemma 3 comes in multiple sizes. For mobile, you’re looking at the 1B and 4B parameter variants. On a device with 4-6GB total RAM, where the OS and foreground app already consume 2-3GB, this is what you’re working with:

Model VariantQuantizationDisk SizePeak RAMTokens/sec (Snapdragon 8 Gen 2)Tokens/sec (A16 Bionic)
Gemma 3 1BFP16~2.0 GB~2.4 GB1215
Gemma 3 1BQ8_0~1.1 GB~1.4 GB1822
Gemma 3 1BQ4_K_M~0.6 GB~0.8 GB2630
Gemma 3 4BQ8_0~4.2 GB~4.8 GB68
Gemma 3 4BQ4_K_M~2.3 GB~2.8 GB1114

On a mid-range device with 4GB RAM, the 1B Q4_K_M variant is your only safe bet. The 4B model at Q4 technically loads on 6GB devices, but you’re flirting with OOM kills. In production, “technically loadable” means “crashes for 15% of your users.”

4-bit vs 8-bit: the quality tradeoff

Teams fixate on benchmarks like perplexity and ignore task-specific quality. We ran blind evaluations across three tasks (summarization, structured JSON extraction, and freeform reply generation), scoring output quality on a 1-5 scale across 500 prompts:

TaskQ8_0 (avg score)Q4_K_M (avg score)Delta
Summarization4.23.9-7.1%
JSON extraction4.64.4-4.3%
Reply generation3.83.3-13.2%

For structured tasks, 4-bit quantization is nearly indistinguishable from 8-bit. For creative or open-ended generation, the gap widens noticeably. Our rule of thumb: if the output has a schema, Q4 is fine. If it’s freeform prose, budget for Q8 or gate the feature to flagship devices.

The KMP inference layer

Most teams build the inference integration twice. MediaPipe’s LLM Inference API on Android and CoreML on iOS have fundamentally different APIs, but the orchestration logic (prompt templating, token streaming, memory guards, retry policy) is identical.

We defined a shared interface in common KMP code:

// commonMain
expect class OnDeviceInference {
    suspend fun loadModel(config: ModelConfig): Result<Unit>
    fun streamTokens(prompt: String): Flow<String>
    fun estimateMemoryRequired(config: ModelConfig): Long
    fun unloadModel()
}

data class ModelConfig(
    val modelPath: String,
    val quantization: Quantization,
    val maxTokens: Int = 512,
    val temperture: Float = 0.7f
)

enum class Quantization { Q4_K_M, Q8_0, FP16 }

The actual implementations are thin wrappers. On Android, streamTokens delegates to MediaPipe’s LlmInference.generateResponseAsync() and maps callbacks into a callbackFlow. On iOS, the actual wraps CoreML’s prediction API with a Kotlin/Native Flow emitter.

The shared logic that matters lives in commonMain:

// commonMain - shared orchestration
class InferenceOrchestrator(
    private val inference: OnDeviceInference,
    private val memoryMonitor: MemoryMonitor
) {
    suspend fun generate(prompt: String, config: ModelConfig): Flow<String> {
        val available = memoryMonitor.availableMemoryBytes()
        val required = inference.estimateMemoryRequired(config)

        if (required > available * 0.85) {
            return flowOf("[ERROR: insufficient memory]")
        }

        inference.loadModel(config)
        return inference.streamTokens(prompt)
            .onCompletion { inference.unloadModel() }
    }
}

That 85% threshold is not arbitrary. Crossing 90% of available memory triggers aggressive OS garbage collection on Android and jetsam termination on iOS. We learned this the hard way — a 12% crash rate that vanished once we added the guard.

Memory lifecycle management

Loading the model on every request costs roughly 800ms for Q4 1B. We cache the loaded model and evict on onTrimMemory(RUNNING_LOW) on Android and the equivalent didReceiveMemoryWarning on iOS, both surfaced through a shared MemoryMonitor expect/actual. The shared code owns the eviction policy; the platform code just fires the signal.

What we actually shipped with

Default to Q4_K_M for the 1B model on any device under 6GB RAM. Reserve Q8 for flagships and gate it behind a runtime memory check, not a device allowlist that rots within months.

Build your inference integration as an expect/actual KMP layer from day one. The platform-specific code is under 200 lines per platform. The shared prompt orchestration, memory management, and streaming logic is 10x that. Don’t write it twice.

Instrument memory headroom aggressively. Ship telemetry for available-memory-at-inference-time. Our data showed a bimodal distribution: users either had 1.5GB free or 400MB. That insight drove our decision to make the feature opt-in on low-memory devices rather than degrading silently.

On-device inference is a real shipping feature now. But it stays that way only if you treat the memory wall as a first-class engineering constraint.


Share: Twitter LinkedIn