Gemini Nano On-Device Function Calling for Android: Structured Output, Token Budget Constraints, and the Architecture That Makes Offline AI Agents Practical

TL;DR

Gemini Nano now supports on-device function calling and structured JSON output, which makes offline-capable AI agents viable on Android. But the 32K context window and quantized model hallucinations demand a different architecture than cloud-first approaches. This post covers what that architecture looks like in production, where it breaks, and the WorkManager + Room pipeline that keeps everything reliable.

The on-device shift after I/O 2026

Google’s expansion of Gemini Nano capabilities at I/O 2026 moved on-device AI past smart autocomplete into genuine agent territory. Function calling and structured output are the two features that matter. Your app can now define tool schemas, send them to a local model, and receive structured JSON actions, all without a network round-trip.

In my experience building production systems, “works in the demo” and “works on a user’s mid-range phone” are separated by a canyon of architectural decisions. I’ll walk through the ones that matter.

The 32K token budget problem

Cloud Gemini Flash gives you a generous context window. Gemini Nano gives you roughly 32K tokens on-device. That budget must cover your system prompt, tool definitions, conversation history, and the model’s response. Most teams get this wrong by porting their cloud schemas directly.

A cloud-friendly tool schema for a calendar agent might define 15 tools with rich descriptions. On-device, you need to be surgical.

// Bad: verbose schema that eats your token budget
val cloudSchema = Tool(
    name = "create_calendar_event",
    description = "Creates a new calendar event with the specified title, " +
        "date, time, duration, location, attendees, recurrence pattern, " +
        "reminder settings, and optional notes...",
    parameters = /* 12 parameters with long descriptions */
)

// Good: minimal schema optimized for on-device budget
val nanoSchema = Tool(
    name = "cal_create",
    description = "Create event",
    parameters = listOf(
        Param("title", "string", required = true),
        Param("iso_time", "string", required = true),
        Param("dur_min", "int", required = true)
    )
)

The numbers tell the story. A trimmed schema set of 5 tools consumes roughly 800-1,200 tokens, leaving headroom for meaningful conversation context. A verbose 15-tool schema can eat 4,000+ tokens before a single user message.

On-device vs. cloud: architecture tradeoffs

Dimension	Gemini Nano (on-device)	Gemini Flash (cloud)
Context window	~32K tokens	1M+ tokens
Latency (first token)	80-200ms	300-800ms (network dependent)
Function call reliability	Degrades with schema complexity	Stable across complex schemas
Structured JSON consistency	Requires validation + retry	Generally reliable
Availability	Always-on, no network needed	Requires connectivity
Cost per call	Zero marginal	Per-token API pricing

The latency advantage matters for interactive mobile UX. But the reliability gap is the architectural challenge you actually need to design around.

Handling hallucinations in structured output

A quantized on-device model hallucinates more than its cloud counterpart. For function calling, this shows up as malformed JSON, invented parameter names, or calls to tools that don’t exist in your schema.

The defense is a three-layer validation pipeline:

fun parseAgentAction(raw: String): AgentAction? {
    // Layer 1: Extract JSON from response (model may wrap it in markdown)
    val json = JsonExtractor.findFirst(raw) ?: return null

    // Layer 2: Validate against registered tool schemas
    val parsed = try {
        toolRegistry.parse(json)
    } catch (e: SchemaValidationException) {
        null
    }

    // Layer 3: Semantic bounds checking
    return parsed?.takeIf { action ->
        semanticValidator.isReasonable(action)
        // e.g., duration_min in 1..480, title.length < 200
    }
}

In practice, Layer 1 catches roughly half of all failures. The model returns valid function calls but wraps them in explanatory text. A solid JSON extractor is table stakes here.

The WorkManager + Room offline pipeline

Where on-device function calling really earns its keep is offline operation. A user on an airplane says “schedule a team sync for Tuesday at 2pm.” Gemini Nano parses the intent and produces a structured action locally. But the calendar API requires connectivity.

The architecture is straightforward:

Gemini Nano produces a structured AgentAction
Room database persists the action with status PENDING
WorkManager enqueues a OneTimeWorkRequest with network constraints
An executor processes the action when connectivity returns, updating status to COMPLETED or FAILED

@Entity(tableName = "agent_actions")
data class AgentAction(
    @PrimaryKey(autoGenerate = true) val id: Long = 0,
    val toolName: String,
    val paramsJson: String,
    val status: ActionStatus = ActionStatus.PENDING,
    val createdAt: Long = System.currentTimeMillis()
)

// Queue for execution when network is available
val request = OneTimeWorkRequestBuilder<ActionExecutorWorker>()
    .setConstraints(
        Constraints.Builder()
            .setRequiredNetworkType(NetworkType.CONNECTED)
            .build()
    )
    .setInputData(workDataOf("action_id" to action.id))
    .build()

WorkManager.getInstance(context).enqueue(request)

This pattern gives you immediate user feedback (“Got it, I’ll create that event when you’re back online”) while guaranteeing eventual execution. Room provides durability across process death, and WorkManager handles retry with exponential backoff.

What I’d do from here

Shrink your tool schemas aggressively. Budget 1,200 tokens maximum for tool definitions. Use short names, minimal descriptions, and cap at 5 tools per agent context. Swap tool sets dynamically based on user intent rather than loading everything at once.

Build validation in layers. JSON extraction, schema validation, semantic bounds checking. The on-device model will produce malformed output at a meaningful rate, and your app should recover gracefully rather than crash.

Adopt the WorkManager + Room queue pattern early, even if your initial use case is online-only. This architecture lets you go offline with zero refactoring. The persistence layer also doubles as an audit log of every agent action, which is useful for debugging and for showing users what the agent did on their behalf.

Gemini Nano On-Device Function Calling for Android: Structured Output, Token Budget Constraints, and the Architecture That Makes Offline AI Agents Practical

TL;DR

The on-device shift after I/O 2026

The 32K token budget problem

On-device vs. cloud: architecture tradeoffs

Handling hallucinations in structured output

The WorkManager + Room offline pipeline

What I’d do from here

Related Posts

Diagnosing Android Jank with FrameTimeline API: Surfaceflinger Deadlines, HWUI Thread Contention, and the Systrace Workflow That Pinpoints Exact Recomposition Frames Dropping Below 16ms

Gemini Nano On-Device Function Calling for Android: Structured Output, Token Budget Constraints, and the Architecture That Makes Offline AI Agents Practical

App Store Keyword Cannibalization: How Your Own Apps Compete Against Each Other and the Metadata Architecture That Fixes It