Gemini Nano On-Device Function Calling for Android: Structured Output, Token Budget Constraints, and the Architecture That Makes Offline AI Agents Practical
TL;DR
Gemini Nano now supports on-device function calling and structured JSON output, which makes offline-capable AI agents viable on Android. But the 32K context window and quantized model hallucinations demand a different architecture than cloud-first approaches. This post covers what that architecture looks like in production, where it breaks, and the WorkManager + Room pipeline that keeps everything reliable.
The on-device shift after I/O 2026
Google’s expansion of Gemini Nano capabilities at I/O 2026 moved on-device AI past smart autocomplete into genuine agent territory. Function calling and structured output are the two features that matter. Your app can now define tool schemas, send them to a local model, and receive structured JSON actions, all without a network round-trip.
In my experience building production systems, “works in the demo” and “works on a user’s mid-range phone” are separated by a canyon of architectural decisions. I’ll walk through the ones that matter.
The 32K token budget problem
Cloud Gemini Flash gives you a generous context window. Gemini Nano gives you roughly 32K tokens on-device. That budget must cover your system prompt, tool definitions, conversation history, and the model’s response. Most teams get this wrong by porting their cloud schemas directly.
A cloud-friendly tool schema for a calendar agent might define 15 tools with rich descriptions. On-device, you need to be surgical.
// Bad: verbose schema that eats your token budget
val cloudSchema = Tool(
name = "create_calendar_event",
description = "Creates a new calendar event with the specified title, " +
"date, time, duration, location, attendees, recurrence pattern, " +
"reminder settings, and optional notes...",
parameters = /* 12 parameters with long descriptions */
)
// Good: minimal schema optimized for on-device budget
val nanoSchema = Tool(
name = "cal_create",
description = "Create event",
parameters = listOf(
Param("title", "string", required = true),
Param("iso_time", "string", required = true),
Param("dur_min", "int", required = true)
)
)
The numbers tell the story. A trimmed schema set of 5 tools consumes roughly 800-1,200 tokens, leaving headroom for meaningful conversation context. A verbose 15-tool schema can eat 4,000+ tokens before a single user message.
On-device vs. cloud: architecture tradeoffs
| Dimension | Gemini Nano (on-device) | Gemini Flash (cloud) |
|---|---|---|
| Context window | ~32K tokens | 1M+ tokens |
| Latency (first token) | 80-200ms | 300-800ms (network dependent) |
| Function call reliability | Degrades with schema complexity | Stable across complex schemas |
| Structured JSON consistency | Requires validation + retry | Generally reliable |
| Availability | Always-on, no network needed | Requires connectivity |
| Cost per call | Zero marginal | Per-token API pricing |
The latency advantage matters for interactive mobile UX. But the reliability gap is the architectural challenge you actually need to design around.
Handling hallucinations in structured output
A quantized on-device model hallucinates more than its cloud counterpart. For function calling, this shows up as malformed JSON, invented parameter names, or calls to tools that don’t exist in your schema.
The defense is a three-layer validation pipeline:
fun parseAgentAction(raw: String): AgentAction? {
// Layer 1: Extract JSON from response (model may wrap it in markdown)
val json = JsonExtractor.findFirst(raw) ?: return null
// Layer 2: Validate against registered tool schemas
val parsed = try {
toolRegistry.parse(json)
} catch (e: SchemaValidationException) {
null
}
// Layer 3: Semantic bounds checking
return parsed?.takeIf { action ->
semanticValidator.isReasonable(action)
// e.g., duration_min in 1..480, title.length < 200
}
}
In practice, Layer 1 catches roughly half of all failures. The model returns valid function calls but wraps them in explanatory text. A solid JSON extractor is table stakes here.
The WorkManager + Room offline pipeline
Where on-device function calling really earns its keep is offline operation. A user on an airplane says “schedule a team sync for Tuesday at 2pm.” Gemini Nano parses the intent and produces a structured action locally. But the calendar API requires connectivity.
The architecture is straightforward:
- Gemini Nano produces a structured
AgentAction - Room database persists the action with status
PENDING - WorkManager enqueues a
OneTimeWorkRequestwith network constraints - An executor processes the action when connectivity returns, updating status to
COMPLETEDorFAILED
@Entity(tableName = "agent_actions")
data class AgentAction(
@PrimaryKey(autoGenerate = true) val id: Long = 0,
val toolName: String,
val paramsJson: String,
val status: ActionStatus = ActionStatus.PENDING,
val createdAt: Long = System.currentTimeMillis()
)
// Queue for execution when network is available
val request = OneTimeWorkRequestBuilder<ActionExecutorWorker>()
.setConstraints(
Constraints.Builder()
.setRequiredNetworkType(NetworkType.CONNECTED)
.build()
)
.setInputData(workDataOf("action_id" to action.id))
.build()
WorkManager.getInstance(context).enqueue(request)
This pattern gives you immediate user feedback (“Got it, I’ll create that event when you’re back online”) while guaranteeing eventual execution. Room provides durability across process death, and WorkManager handles retry with exponential backoff.
What I’d do from here
Shrink your tool schemas aggressively. Budget 1,200 tokens maximum for tool definitions. Use short names, minimal descriptions, and cap at 5 tools per agent context. Swap tool sets dynamically based on user intent rather than loading everything at once.
Build validation in layers. JSON extraction, schema validation, semantic bounds checking. The on-device model will produce malformed output at a meaningful rate, and your app should recover gracefully rather than crash.
Adopt the WorkManager + Room queue pattern early, even if your initial use case is online-only. This architecture lets you go offline with zero refactoring. The persistence layer also doubles as an audit log of every agent action, which is useful for debugging and for showing users what the agent did on their behalf.