Adaptive Bitrate Model Loading on Android: Dynamic GGUF Shard Selection Based on Runtime Memory Pressure and Thermal State
The problem with static model loading
Most Android on-device inference implementations pick a single GGUF quantization level at startup and pray. Load Q8_0 on a Pixel 7 with 8GB RAM and you get great quality, until the user switches to a background music app and ActivityManager.getMemoryInfo() reports lowMemory = true. Load Q4_K_M defensively and you leave performance on the table for flagship devices.
What most teams get wrong: they treat model loading as a one-shot decision. In production, device conditions are non-stationary.
Architecture overview
The adaptive loader has three components:
| Component | Responsibility | Android API |
|---|---|---|
| MemoryPressureMonitor | Tracks available RAM, triggers level changes | ActivityManager.getMemoryInfo(), ComponentCallbacks2.onTrimMemory() |
| ThermalStateObserver | Monitors thermal throttling state | PowerManager.addThermalStatusListener() (API 29+) |
| ShardOrchestrator | Manages shard selection, swap logic, KV cache migration | Custom implementation over llama.cpp JNI bindings |
The core idea: treat quantization tiers exactly like video bitrate tiers in HLS/DASH. Step down gracefully under pressure, step up when headroom returns.
Shard tier definitions
enum class GgufTier(
val fileName: String,
val estimatedRamMb: Int,
val qualityScore: Float
) {
HIGH("model-q8_0.gguf", 7200, 0.95f),
MEDIUM("model-q5_k_s.gguf", 4800, 0.88f),
LOW("model-q4_k_m.gguf", 3400, 0.82f);
}
These RAM estimates are for a 7B parameter model. In practice, the actual footprint varies by ~8-12% depending on context length and batch size, so always add a buffer.
Memory pressure monitor
class MemoryPressureMonitor(private val context: Context) {
private val activityManager = context.getSystemService<ActivityManager>()
fun availableHeadroomMb(): Long {
val memInfo = ActivityManager.MemoryInfo()
activityManager.getMemoryInfo(memInfo)
return (memInfo.availMem - memInfo.threshold) / (1024 * 1024)
}
fun recommendTier(): GgufTier {
val headroom = availableHeadroomMb()
return when {
headroom > 8000 -> GgufTier.HIGH
headroom > 5500 -> GgufTier.MEDIUM
else -> GgufTier.LOW
}
}
}
Thermal state observer
Android API 29+ actually gives us something useful here. PowerManager.THERMAL_STATUS_MODERATE and above should trigger an immediate downshift. Thermal throttling murders inference throughput before it kills your process.
class ThermalStateObserver(context: Context) {
private val powerManager = context.getSystemService<PowerManager>()
private val _thermalState = MutableStateFlow(PowerManager.THERMAL_STATUS_NONE)
val thermalState: StateFlow<Int> = _thermalState.asStateFlow()
init {
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.Q) {
powerManager.addThermalStatusListener(Executors.newSingleThreadExecutor()) {
_thermalState.value = it
}
}
}
fun shouldDownshift(): Boolean =
_thermalState.value >= PowerManager.THERMAL_STATUS_MODERATE
}
The numbers back this up. On a sustained inference workload, a Snapdragon 8 Gen 2 hitting THERMAL_STATUS_MODERATE typically sees 30-40% throughput degradation on Q8_0. Dropping to Q5_K_S recovers most of that because you reduce both memory bandwidth pressure and compute load.
Mid-session shard swapping with KV cache migration
This is the hard part. Naively, swapping shards means discarding the KV cache and losing conversational context. The workaround: serialize the KV cache from the active llama.cpp context, unload the current shard, load the new shard, then deserialize the KV cache into the new context.
class ShardOrchestrator(
private val memoryMonitor: MemoryPressureMonitor,
private val thermalObserver: ThermalStateObserver
) {
private var activeTier: GgufTier = GgufTier.MEDIUM
private var llamaContext: Long = 0L // JNI pointer
suspend fun evaluateAndSwap() {
val targetTier = when {
thermalObserver.shouldDownshift() ->
minOf(activeTier.ordinal + 1, GgufTier.entries.lastIndex)
.let { GgufTier.entries[it] }
else -> memoryMonitor.recommendTier()
}
if (targetTier != activeTier) {
val kvCacheBytes = LlamaBridge.serializeKvCache(llamaContext)
LlamaBridge.freeContext(llamaContext)
llamaContext = LlamaBridge.loadModel(targetTier.fileName)
LlamaBridge.deserializeKvCache(llamaContext, kvCacheBytes)
activeTier = targetTier
}
}
}
One caveat that will bite you: KV cache dimensions differ across quantization levels in some configurations. If your GGUF shards all share the same base architecture and context length (which they should if generated from the same source model) the KV cache is compatible. Verify this in testing. Mismatched cache dimensions will produce garbage output or segfault through the JNI layer.
Tier comparison under pressure
| Scenario | Q8_0 | Q5_K_S | Q4_K_M |
|---|---|---|---|
| RAM usage (7B model) | ~7.2 GB | ~4.8 GB | ~3.4 GB |
| Tokens/sec (Snapdragon 8 Gen 2, cool) | ~12 | ~18 | ~24 |
| Tokens/sec (thermally throttled) | ~7 | ~14 | ~20 |
| Perplexity delta vs FP16 | +0.05 | +0.12 | +0.18 |
The throughput advantage of lower quantization tiers grows proportionally larger under thermal constraints, which is exactly when you need it.
What to do with all this
Treat quantization selection as a runtime decision, not a build-time one. Ship all three GGUF shards in your APK (or download them on demand via Play Asset Delivery) and let device conditions drive the choice.
Prioritize thermal state over memory pressure. Memory warnings give you seconds to react; thermal throttling gives you milliseconds of degraded performance before the OS intervenes. Wire PowerManager.addThermalStatusListener() first.
Invest in KV cache serialization early. Mid-session shard swapping without cache migration destroys the user experience. The JNI work to expose llama.cpp’s llama_copy_state_data / llama_set_state_data is non-trivial but pays off immediately.