MVP Factory
ai startup development

Adaptive Bitrate Model Loading on Android: Dynamic GGUF Shard Selection Based on Runtime Memory Pressure and Thermal State

KW
Krystian Wiewiór · · 4 min read

The problem with static model loading

Most Android on-device inference implementations pick a single GGUF quantization level at startup and pray. Load Q8_0 on a Pixel 7 with 8GB RAM and you get great quality, until the user switches to a background music app and ActivityManager.getMemoryInfo() reports lowMemory = true. Load Q4_K_M defensively and you leave performance on the table for flagship devices.

What most teams get wrong: they treat model loading as a one-shot decision. In production, device conditions are non-stationary.

Architecture overview

The adaptive loader has three components:

ComponentResponsibilityAndroid API
MemoryPressureMonitorTracks available RAM, triggers level changesActivityManager.getMemoryInfo(), ComponentCallbacks2.onTrimMemory()
ThermalStateObserverMonitors thermal throttling statePowerManager.addThermalStatusListener() (API 29+)
ShardOrchestratorManages shard selection, swap logic, KV cache migrationCustom implementation over llama.cpp JNI bindings

The core idea: treat quantization tiers exactly like video bitrate tiers in HLS/DASH. Step down gracefully under pressure, step up when headroom returns.

Shard tier definitions

enum class GgufTier(
    val fileName: String,
    val estimatedRamMb: Int,
    val qualityScore: Float
) {
    HIGH("model-q8_0.gguf", 7200, 0.95f),
    MEDIUM("model-q5_k_s.gguf", 4800, 0.88f),
    LOW("model-q4_k_m.gguf", 3400, 0.82f);
}

These RAM estimates are for a 7B parameter model. In practice, the actual footprint varies by ~8-12% depending on context length and batch size, so always add a buffer.

Memory pressure monitor

class MemoryPressureMonitor(private val context: Context) {
    private val activityManager = context.getSystemService<ActivityManager>()

    fun availableHeadroomMb(): Long {
        val memInfo = ActivityManager.MemoryInfo()
        activityManager.getMemoryInfo(memInfo)
        return (memInfo.availMem - memInfo.threshold) / (1024 * 1024)
    }

    fun recommendTier(): GgufTier {
        val headroom = availableHeadroomMb()
        return when {
            headroom > 8000 -> GgufTier.HIGH
            headroom > 5500 -> GgufTier.MEDIUM
            else -> GgufTier.LOW
        }
    }
}

Thermal state observer

Android API 29+ actually gives us something useful here. PowerManager.THERMAL_STATUS_MODERATE and above should trigger an immediate downshift. Thermal throttling murders inference throughput before it kills your process.

class ThermalStateObserver(context: Context) {
    private val powerManager = context.getSystemService<PowerManager>()
    private val _thermalState = MutableStateFlow(PowerManager.THERMAL_STATUS_NONE)
    val thermalState: StateFlow<Int> = _thermalState.asStateFlow()

    init {
        if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.Q) {
            powerManager.addThermalStatusListener(Executors.newSingleThreadExecutor()) {
                _thermalState.value = it
            }
        }
    }

    fun shouldDownshift(): Boolean =
        _thermalState.value >= PowerManager.THERMAL_STATUS_MODERATE
}

The numbers back this up. On a sustained inference workload, a Snapdragon 8 Gen 2 hitting THERMAL_STATUS_MODERATE typically sees 30-40% throughput degradation on Q8_0. Dropping to Q5_K_S recovers most of that because you reduce both memory bandwidth pressure and compute load.

Mid-session shard swapping with KV cache migration

This is the hard part. Naively, swapping shards means discarding the KV cache and losing conversational context. The workaround: serialize the KV cache from the active llama.cpp context, unload the current shard, load the new shard, then deserialize the KV cache into the new context.

class ShardOrchestrator(
    private val memoryMonitor: MemoryPressureMonitor,
    private val thermalObserver: ThermalStateObserver
) {
    private var activeTier: GgufTier = GgufTier.MEDIUM
    private var llamaContext: Long = 0L // JNI pointer

    suspend fun evaluateAndSwap() {
        val targetTier = when {
            thermalObserver.shouldDownshift() ->
                minOf(activeTier.ordinal + 1, GgufTier.entries.lastIndex)
                    .let { GgufTier.entries[it] }
            else -> memoryMonitor.recommendTier()
        }

        if (targetTier != activeTier) {
            val kvCacheBytes = LlamaBridge.serializeKvCache(llamaContext)
            LlamaBridge.freeContext(llamaContext)
            llamaContext = LlamaBridge.loadModel(targetTier.fileName)
            LlamaBridge.deserializeKvCache(llamaContext, kvCacheBytes)
            activeTier = targetTier
        }
    }
}

One caveat that will bite you: KV cache dimensions differ across quantization levels in some configurations. If your GGUF shards all share the same base architecture and context length (which they should if generated from the same source model) the KV cache is compatible. Verify this in testing. Mismatched cache dimensions will produce garbage output or segfault through the JNI layer.

Tier comparison under pressure

ScenarioQ8_0Q5_K_SQ4_K_M
RAM usage (7B model)~7.2 GB~4.8 GB~3.4 GB
Tokens/sec (Snapdragon 8 Gen 2, cool)~12~18~24
Tokens/sec (thermally throttled)~7~14~20
Perplexity delta vs FP16+0.05+0.12+0.18

The throughput advantage of lower quantization tiers grows proportionally larger under thermal constraints, which is exactly when you need it.

What to do with all this

Treat quantization selection as a runtime decision, not a build-time one. Ship all three GGUF shards in your APK (or download them on demand via Play Asset Delivery) and let device conditions drive the choice.

Prioritize thermal state over memory pressure. Memory warnings give you seconds to react; thermal throttling gives you milliseconds of degraded performance before the OS intervenes. Wire PowerManager.addThermalStatusListener() first.

Invest in KV cache serialization early. Mid-session shard swapping without cache migration destroys the user experience. The JNI work to expose llama.cpp’s llama_copy_state_data / llama_set_state_data is non-trivial but pays off immediately.


TAGS: android, kotlin, architecture, mobile, kmp


Share: Twitter LinkedIn