Structured output from on-device LLMs on Android with GBNF

Meta description: Learn how to enforce JSON schema output and build offline agent loops with on-device LLMs on Android using GBNF grammars, llama.cpp, and Kotlin coroutines.

Tags: kotlin, android, llm, structured-output, on-device-ai, jetpack-compose

TL;DR

Getting raw text out of an on-device LLM is the easy part. What actually matters is structured output: guaranteed-valid JSON that your app can parse without crossing its fingers. Combine GBNF grammars in llama.cpp with a coroutine-based agent loop, and you get multi-step reasoning features that run entirely offline, stay under thermal budgets, and never drop a frame. This post covers the full stack: grammar-constrained decoding, function-calling dispatch, and the architecture that ties it together.

Why structured output matters

I’ve spent enough time building production systems to know that the gap between “model generates text” and “model drives application logic” comes down to one thing: parseable output. A chatbot can tolerate freeform text. An agent loop that dispatches tool calls cannot.

The ReAct pattern (Yao et al., 2023) showed that interleaving reasoning and action steps produces far better results from language models, but each step depends on structured contracts between the reasoning trace and the tool dispatch layer. On the server side, you get this from API-level JSON mode. On-device, you have to enforce it yourself.

Most teams try to prompt-engineer their way to valid JSON. That works 80-something percent of the time. The rest of the time, it crashes your app.

Grammar-constrained decoding with GBNF

llama.cpp supports GBNF grammars: formal grammar definitions that constrain token sampling at decode time. Instead of hoping the model outputs valid JSON, you guarantee it.

A simplified example:

root   ::= "{" ws members ws "}"
members ::= pair ("," ws pair)*
pair   ::= string ws ":" ws value
value  ::= string | number | "true" | "false" | "null"
string ::= "\"" [a-zA-Z0-9_ ]+ "\""
number ::= [0-9]+
ws     ::= [ \t\n]*

Note: This grammar covers flat key-value objects for illustration purposes. It doesn’t handle nested objects or arrays. For production use, see the full tool-call grammar below, or refer to the GBNF guide in the llama.cpp repository for comprehensive JSON grammars with recursive value definitions.

The principle is what matters: every sampled token is validated against the grammar state machine. If a token would violate the grammar, its logit is masked to negative infinity before softmax. The model literally cannot produce output that breaks your schema.

Defining a tool-call schema

For function calling, you want something tighter than “any JSON.” Define a grammar that matches your specific tool schema:

val toolCallGrammar = """
root    ::= "{" ws "\"tool\"" ws ":" ws tool-name ws "," 
             ws "\"args\"" ws ":" ws "{" ws args ws "}" ws "}"
tool-name ::= "\"search\"" | "\"calculate\"" | "\"summarize\""
args    ::= pair ("," ws pair)*
pair    ::= string ws ":" ws value
value   ::= string | number | bool | object
object  ::= "{" ws (pair ("," ws pair)*)? ws "}"
bool    ::= "true" | "false"
string  ::= "\"" [^"]* "\""
number  ::= "-"? [0-9]+ ("." [0-9]+)?
ws      ::= [ \t\n]*
""".trimIndent()

This handles nested objects in arguments and restricts tool names to your known set. The model can’t hallucinate a tool that doesn’t exist.

The performance trade-off

We tested this on a Pixel 8 with a 3B-parameter quantized model (Q4_K_M) across roughly 1,000 structured extraction calls using Llama 3.2 3B:

Approach	Valid JSON rate	Latency overhead	Recovery cost
Prompt-only	~80-85%	None	Retry (2-3x latency)
Regex post-filter	~90%	Minimal	Partial retry
GBNF grammar	100%	~5-8% decode time	None

Your results will vary with model size and grammar complexity, but the direction is consistent: constrained decoding eliminates an entire class of runtime failures. That 5-8% overhead is a bargain. No retries, no error handling for malformed output, no defensive parsing.

The coroutine-based agent loop

With structured output guaranteed, you can build a proper agent loop. Nothing fancy here:

sealed class AgentAction {
    data class ToolCall(val tool: String, val args: Map<String, Any>) : AgentAction()
    data class FinalAnswer(val text: String) : AgentAction()
}

suspend fun agentLoop(
    prompt: String,
    llamaEngine: LlamaEngine,
    maxSteps: Int = 5
): String = withContext(Dispatchers.Default) {
    var context = prompt
    repeat(maxSteps) { step ->
        val output = llamaEngine.generate(
            prompt = context,
            grammar = toolCallGrammar,
            maxTokens = 256
        )
        when (val action = parseAction(output)) {
            is AgentAction.ToolCall -> {
                val result = dispatch(action.tool, action.args)
                context += "\nObservation: $result\nThought:"
            }
            is AgentAction.FinalAnswer -> return@withContext action.text
        }
        ensureActive()
    }
    "Max steps reached"
}

A few things worth calling out:

Dispatchers.Default, not Dispatchers.IO. Inference is CPU-bound, not I/O-bound. You want the shared thread pool sized to core count.
ensureActive() at each step. If the user navigates away, cancel the loop instead of burning battery.
Bounded steps. An unbounded agent loop on a mobile device is a thermal throttling event waiting to happen.

Keeping 60fps and respecting thermal budgets

The agent loop runs on a background dispatcher, but you need two more mechanisms to play nice with the device:

val thermalStatus = context.getSystemService<PowerManager>()
    ?.currentThermalStatus ?: THERMAL_STATUS_NONE

if (thermalStatus >= THERMAL_STATUS_MODERATE) {
    llamaEngine.setThreadCount(2)
    delay(200)
}

Thermal status	Thread count	Step delay	Token limit
NONE / LIGHT	4	0ms	256
MODERATE	2	200ms	128
SEVERE	1	500ms	64
CRITICAL	Pause loop	—	—

The UI thread stays clean because all inference is off-main-thread. Emit partial results via StateFlow and let Jetpack Compose recompose on collection. No LiveData, no callbacks, no frame drops.

When on-device agents make sense

I want to be honest about the limits. This architecture doesn’t replace server-side agents. Models in the 1-7B parameter range handle structured extraction, classification, and simple multi-step reasoning well. They struggle with complex planning or knowledge-intensive tasks. Choose on-device when you need offline capability, latency under 500ms for short generations, or when user data must never leave the device.

What to take from this

Use GBNF grammars, not prompts, to enforce structured output. The 5-8% decode overhead eliminates an entire class of runtime errors and removes retry logic from your codebase.

Bound your agent loop and monitor thermal state. Mobile devices are not servers. Cap iteration count, reduce thread count under thermal pressure, and always respect coroutine cancellation.

Treat the on-device model as a structured-output engine, not a chatbot. Define tight grammars matching your tool schemas, dispatch deterministically, and keep the reasoning chain short. That’s where small models actually shine.