MVP Factory
ai startup development

Running Vision-Language Models On-Device in Android

KW
Krystian Wiewiór · · 5 min read

Meta description: Learn how to run quantized VLMs on Android using split-delegate architecture, CameraX integration, and Kotlin coroutines for real-time on-device image understanding.

TL;DR: Running vision-language models on Android requires splitting inference across two delegates — GPU for the CLIP vision encoder, NNAPI/CPU for the language decoder — with INT4 quantization on the LM head and INT8 on the vision tower. Combined with a CameraX frame buffer pipeline and structured Kotlin coroutine streaming, you can achieve real-time image understanding without melting the device.


The dual-model problem

Vision-language models like LLaVA and MobileVLM aren’t single models. They’re two models stitched together: a CLIP-family vision encoder that converts images into embedding vectors, and a language model decoder that consumes those embeddings to generate text. In my experience building production on-device ML systems, this dual-model reality is where most teams hit their first wall.

On a server, you throw both at a beefy GPU and move on. On a Snapdragon 8 Gen 3 or Tensor G4 with shared memory, thermal budgets, and a camera preview that users expect at 60fps, you need a different strategy entirely.

The split-delegate architecture

The core insight: the vision encoder and language decoder have different computational profiles and should run on different hardware delegates.

ComponentOptimal DelegateQuantizationTypical Latency (Pixel 8 Pro)Memory Footprint
CLIP Vision EncoderGPU DelegateINT8~40-80ms per frame~150-300MB
Language Decoder (1.3B-3B params)NNAPI / CPUINT4 (GPTQ/AWQ)~200-500ms per token~800MB-1.5GB
Projection LayerCPUFP16<5msNegligible

The vision encoder is dense matrix math. It maps cleanly onto GPU shader cores via TFLite’s GPU delegate. The language decoder, with its autoregressive token-by-token generation, benefits less from GPU parallelism and often runs better on NNAPI or even optimized CPU paths with XNNPACK.

Quantization trade-offs

Most teams get this wrong: they apply the same quantization strategy to both components.

The vision tower is sensitive to aggressive quantization. Dropping CLIP to INT4 measurably degrades embedding quality, which cascades into worse language output. INT8 symmetric quantization preserves visual fidelity with minimal accuracy loss.

The language decoder, conversely, tolerates INT4 well, especially with group-wise quantization (GPTQ with 128-group size or AWQ). The perplexity increase is marginal, but the memory savings are real: a 3B-parameter decoder drops from ~6GB (FP16) to ~1.5GB (INT4).

CameraX frame buffer pipeline

Feeding camera frames into the vision encoder requires careful buffer management. The goal: capture frames without blocking the preview.

class VLMFrameAnalyzer(
    private val visionEncoder: Interpreter,
    private val scope: CoroutineScope
) : ImageAnalysis.Analyzer {

    private val frameChannel = Channel<Bitmap>(capacity = 1, 
        onBufferOverflow = BufferOverflow.DROP_OLDEST)

    override fun analyze(imageProxy: ImageProxy) {
        val bitmap = imageProxy.toBitmap()  // extension function
        frameChannel.trySend(bitmap)
        imageProxy.close()  // always close immediately
    }

    fun embeddings(): Flow<FloatArray> = frameChannel.receiveAsFlow()
        .flowOn(Dispatchers.Default)
        .map { bitmap ->
            val input = preprocessForCLIP(bitmap, 224)
            val output = Array(1) { FloatArray(768) }
            visionEncoder.run(input, output)
            output[0]
        }
}

The critical detail: DROP_OLDEST on the channel. Under sustained inference, you will fall behind real-time. Dropping stale frames is correct behavior. Users want the model to reason about what the camera sees now, not 400ms ago.

Streaming pipeline with Kotlin coroutines

The full pipeline connects CameraX → vision encoder → projection → language decoder as a structured coroutine flow:

fun runVLMPipeline(
    analyzer: VLMFrameAnalyzer,
    decoder: LanguageDecoder,
    prompt: String
): Flow<String> = analyzer.embeddings()
    .sample(500)  // limit to ~2 inferences/sec
    .map { embeddings -> decoder.generate(prompt, embeddings) }
    .flowOn(Dispatchers.Default)

The sample(500) operator is your thermal throttling knob. On sustained inference, SoC temperatures climb fast with dual-model workloads. Sampling at 500ms intervals keeps most devices under thermal limits during sustained use.

Memory pressure management

Running two models simultaneously on a device with 8-12GB total RAM (shared with the OS, other apps, and the camera HAL) takes discipline:

  • Lazy-load the language decoder. Keep only the vision encoder resident during camera preview. Load the decoder on first query.
  • Memory-map model weights via TFLite’s MappedByteBuffer. This lets the OS page out inactive segments under pressure.
  • Monitor ComponentCallbacks2 and downgrade gracefully: drop to vision-only mode on TRIM_MEMORY_RUNNING_LOW.

What to ship with

Split your delegates intentionally. GPU for the vision encoder, NNAPI/CPU for the language decoder. Don’t run both on the same delegate. You’ll hit contention and get worse throughput than splitting.

Quantize asymmetrically. INT8 for the vision tower to preserve embedding quality, INT4 for the language decoder to fit in memory. Test embedding cosine similarity against FP16 baselines before shipping.

Design for thermal steady-state, not peak throughput. Sample frames, throttle inference frequency, and instrument ThermalStatusListener. The fastest model is worthless if the device throttles to half speed after 30 seconds.

On-device VLMs are viable today. But only if you respect the hardware constraints instead of fighting them.


Tags: android, kotlin, architecture, mobile, jetpackcompose


Share: Twitter LinkedIn