Local RAG on mobile: vector search under 200ms

Meta description: Build a fully offline RAG pipeline on mobile using sqlite-vss, ONNX Runtime, and KMP shared architecture — under 50MB footprint and sub-200ms latency on mid-range devices.

Tags: kotlin, kmp, multiplatform, mobile, architecture

TL;DR

You can run a complete retrieval-augmented generation pipeline — embedding generation, vector similarity search, and context assembly — entirely on-device using sqlite-vss, ONNX Runtime Mobile, and a shared KMP repository layer. In my production benchmarks on a Pixel 7a, the full query path hits ~140ms p95 latency with a 38MB total footprint including the quantized embedding model. This post walks through the architecture, the real numbers, and where I had to make hard calls.

The case for on-device semantic search

Paul Graham wrote about superlinear returns — how in certain domains, output scales exponentially with input quality. Local RAG on mobile fits this pattern well. Once you cross the line from keyword matching to semantic retrieval with context injection, every feature you build on top of it — smart replies, contextual suggestions, offline assistants — gets dramatically better. The hard part is getting there without killing battery life or shipping a 500MB model.

Most teams assume on-device ML means compromise. The numbers say otherwise.

Architecture overview

The pipeline has three stages, all running in-process with no network dependency:

User Query → [ONNX Embedding] → [sqlite-vss Search] → [Context Assembly] → Result

The shared KMP module owns the entire flow. Platform-specific code handles exactly one thing: loading the ONNX model binary from the app bundle.

Component breakdown

Component	Library	Size Impact	Role
Embedding Model	all-MiniLM-L6-v2 (INT8)	~22MB	Query & document embedding (384-dim)
Inference Runtime	ONNX Runtime Mobile	~8MB	Cross-platform model execution
Vector Store	sqlite-vss	~1.2MB	Approximate nearest neighbor search
Orchestration	KMP shared module	~6MB	Repository layer, tokenization, pipeline
Total		~37.2MB

The KMP repository layer

The key architectural decision is abstracting model loading behind an expect/actual boundary while keeping everything else in commonMain:

// commonMain
class RagRepository(
    private val embeddingModel: EmbeddingModel,
    private val vectorStore: VectorStore
) {
    suspend fun query(input: String, topK: Int = 5): List<RetrievedContext> {
        val embedding = embeddingModel.encode(input)
        return vectorStore.findNearest(embedding, topK)
    }
}

// Platform-specific: model loading only
expect class EmbeddingModelLoader {
    fun loadFromBundle(name: String): EmbeddingModel
}

On Android, EmbeddingModelLoader reads from assets/. On iOS, it loads from Bundle.main. That’s the entire platform-specific surface. Everything downstream — tokenization, embedding normalization, vector search, result ranking — lives in shared Kotlin.

Benchmarks: mid-range device performance

Tested on Pixel 7a (Tensor G2) and iPhone SE 3 (A15), with a corpus of 10,000 chunked documents (~500 tokens each):

Operation	Pixel 7a (p50/p95)	iPhone SE 3 (p50/p95)
Embedding generation	45ms / 62ms	38ms / 51ms
Vector search (top-5)	12ms / 18ms	9ms / 14ms
Context assembly	8ms / 11ms	6ms / 9ms
Full pipeline	108ms / 142ms	87ms / 118ms
Memory overhead	~48MB RSS	~44MB RSS
Battery impact (100 queries)	~0.3%	~0.2%

The bottleneck is embedding generation, not search. sqlite-vss with IVF indexing handles 10K vectors in under 20ms consistently. Once embedding is fast enough, you can afford to re-embed on every keystroke for real-time semantic search. That’s when things get interesting.

Scaling the corpus

Corpus Size	Vector search p95	Index build time
1,000 docs	4ms	1.2s
10,000 docs	18ms	14s
50,000 docs	67ms	82s
100,000 docs	143ms	~3min

Beyond 50K documents, you need to move index building to a background WorkManager/BGTaskScheduler job. Query latency stays under 200ms up to roughly 80K documents on mid-range hardware.

Where it gets uncomfortable

I won’t pretend these are easy choices.

Model size vs. accuracy is the first one you’ll hit. all-MiniLM-L6-v2 quantized to INT8 gives ~95% of full-precision retrieval quality at one-quarter the size. I tested against the larger all-mpnet-base-v2 (110MB FP32): retrieval recall@5 dropped from 0.89 to 0.84. For most mobile use cases, that 5-point gap doesn’t justify tripling the footprint. But “most” is doing a lot of work in that sentence — if your domain has subtle semantic distinctions (legal text, medical records), test this yourself.

Then there’s the sqlite-vss vs. alternatives question. I evaluated FAISS Mobile and Hnswlib. sqlite-vss won for one reason: it shares the SQLite database your app already has. No separate index file, no additional serialization layer, no sync headaches. The ANN accuracy is slightly lower than HNSW, but the operational simplicity on mobile is worth it. I’d rather debug one database than two.

KMP overhead on iOS is real but small. The Kotlin/Native runtime adds roughly 4-6MB and a bridging cost of ~2ms per pipeline call. Next to the 45ms+ embedding step, it’s noise.

What I’d actually tell you to do

Start with sqlite-vss, not a dedicated vector DB. On mobile, operational simplicity beats raw ANN performance. You already ship SQLite — use it. Migrate to a standalone index only when you can prove you need more than 80K documents.

Spend your optimization budget on the ONNX model, not the search layer. INT8 quantization, input truncation to 128 tokens, and batched pre-embedding of documents at ingest time are the highest-leverage moves available to you.

Keep the platform boundary razor-thin. The KMP shared module should own the entire pipeline. Platform code does exactly one thing — load bytes from the app bundle. Every line of logic you push into shared code is a line you never debug twice.