MVP Factory
ai startup development

On-Device RAG for Android: Running Embedding Models, Vector Search in SQLite, and the Retrieval Architecture That Keeps Sensitive Data Off the Wire

KW
Krystian Wiewiór · · 5 min read

TL;DR

You can build a fully offline retrieval-augmented generation pipeline on Android. No server round-trips, no data leaving the device. The stack: quantized embedding models via ONNX Runtime Mobile, vector indexing in SQLite with sqlite-vec, smart chunking that respects mobile memory constraints, and a local LLM inference loop wired into a streaming Compose UI. This is the architecture I’ve been refining, and where most teams go wrong when they attempt it.

Why on-device RAG matters

Server-side RAG is well understood. But the moment you deal with medical records, financial documents, legal contracts, or enterprise data subject to compliance, pushing embeddings or raw text to a cloud endpoint becomes a liability. On-device RAG keeps sensitive data off the wire entirely.

The tradeoff is real: constrained memory, limited compute, no beefy GPU cluster to fall back on. But modern quantized models have made this viable on flagship and even mid-range Android hardware. The numbers back it up.

The architecture stack

LayerComponentRole
EmbeddingONNX Runtime Mobile (INT8 quantized)Convert text chunks to dense vectors
Indexingsqlite-vec (SQLite extension)Store and query vectors with approximate nearest neighbor search
ChunkingCustom chunker with overlapSplit documents into memory-safe segments
GenerationLocal LLM (GGUF via llama.cpp bindings)Generate responses from retrieved context
UIJetpack Compose with streaming tokensDisplay results incrementally

Embedding: ONNX Runtime Mobile

The embedding model is the bottleneck that matters most. You need a model small enough to run in under 200ms per chunk on-device, but accurate enough to produce meaningful retrieval results.

Take a model like all-MiniLM-L6-v2 (384-dimensional output, ~22M parameters), export it to ONNX format, then apply INT8 dynamic quantization. This shrinks the model from ~90MB to ~23MB while preserving most retrieval quality.

class OnDeviceEmbedder(context: Context) {
    private val session: OrtSession = OrtEnvironment.getEnvironment()
        .createSession(
            context.assets.open("minilm-quantized.onnx").readBytes(),
            OrtSession.SessionOptions().apply {
                addConfigEntry("session.intra_op.allow_spinning", "0")
                setIntraOpNumThreads(2)
            }
        )

    fun embed(text: String): FloatArray {
        val tokenized = tokenizer.encode(text)
        val inputTensor = OnnxTensor.createTensor(env, tokenized)
        val result = session.run(mapOf("input_ids" to inputTensor))
        return meanPooling(result)
    }
}

Limit intraOpNumThreads to 2. Mobile CPUs thermal-throttle fast. Saturating all cores gives you a burst of speed followed by a cliff. Two threads sustains consistent throughput.

Vector search: sqlite-vec in SQLite

sqlite-vec, the successor to sqlite-vss by Alex Garcia, is a lightweight SQLite extension built specifically for vector search. Unlike sqlite-vss, which pulled in Faiss as a dependency, sqlite-vec is a zero-dependency, single-C-file implementation. On mobile, that difference is huge: smaller binary, simpler build, no native library headaches.

CREATE VIRTUAL TABLE doc_embeddings USING vec0(
    chunk_id INTEGER PRIMARY KEY,
    embedding FLOAT[384]
);

-- Query: find top-5 nearest chunks
SELECT chunk_id, distance
FROM doc_embeddings
WHERE embedding MATCH ?
ORDER BY distance
LIMIT 5;

For corpora under 50,000 chunks, which covers most on-device use cases, brute-force search in sqlite-vec runs in single-digit milliseconds. You get the full transactional guarantees of SQLite for free: atomic writes, crash recovery, single-file portability.

Chunking for mobile memory

This is where most teams go wrong: they port their server-side chunking strategy directly to mobile. A 512-token chunk with 50-token overlap is fine when you have 64GB of RAM. On a device with 6-8GB shared between the OS, your app, and every background process, you need to be more disciplined.

What I recommend:

  • Chunk size: 256 tokens max
  • Overlap: 32 tokens (12.5%)
  • Strategy: sentence-boundary-aware splitting, never breaking mid-sentence
  • Budget: keep total indexed corpus under 10,000 chunks to bound SQLite DB size to ~15-20MB

Wiring the inference loop

The retrieval-to-generation pipeline in a coroutine-based architecture:

fun ragQuery(query: String): Flow<String> = flow {
    val queryVector = embedder.embed(query)
    val topChunks = vectorDb.search(queryVector, k = 5)
    val context = topChunks.joinToString("\n\n") { it.text }

    val prompt = """
        |Given the following context, answer the question.
        |Context: $context
        |Question: $query
    """.trimMargin()

    localLlm.generate(prompt).collect { token ->
        emit(token)
    }
}

On the Compose side, collect this flow into a mutableStateOf string. Each emitted token appends to the displayed text, giving users that streaming feel without any network dependency.

Performance expectations

OperationTypical latency (flagship SoC)
Embed single chunk (INT8, 256 tokens)30-80ms
Vector search (10K chunks, brute-force)2-8ms
LLM first token (3B param, Q4 quantized)500ms-1.5s
LLM token throughput8-15 tokens/sec

These ranges reflect what I’ve seen on recent Snapdragon 8-series and Tensor G-series hardware. Mid-range chipsets will sit at the slower end or beyond these ranges.

What to take away from this

Start with the embedding model, not the LLM. Retrieval quality gates everything downstream. A mediocre LLM with excellent retrieval outperforms a strong LLM with poor retrieval. Get your embedding pipeline right first, measure recall, then layer in generation.

Use sqlite-vec over sqlite-vss for mobile. Zero dependencies, smaller binary, simpler cross-compilation. For the corpus sizes that are realistic on-device, brute-force search is fast enough. You don’t need HNSW complexity on a phone.

Respect the thermal envelope. Cap ONNX Runtime threads, batch your embedding work with delays between batches, and profile on real mid-range devices, not just your development flagship. Thermal throttling is your true constraint, not peak FLOPS.


Share: Twitter LinkedIn