Apple Foundation Models SDK with Claude Code: Building Hybrid On-Device/Cloud AI Pipelines for iOS Apps in Swift

The on-device shift

Apple’s Foundation Models framework gives us a Swift-native API for running language models directly on Apple Silicon. No network round-trip, no data leaving the device, no token costs. For a certain class of tasks — summarization of short text, simple extraction, classification — it removes real friction that previously made these features hard to ship.

But most teams I’ve talked to treat it as an either/or decision. On-device or cloud. The architecture that actually works in production is both, with intelligent routing between them.

Architecture: the tiered inference pipeline

The core idea is a protocol-based adapter pattern that abstracts the inference provider behind a unified interface.

protocol AIProvider {
    func generate(prompt: String, maxTokens: Int) async throws -> String
    func stream(prompt: String) -> AsyncThrowingStream<String, Error>
    var estimatedCapabilityTier: CapabilityTier { get }
}

enum CapabilityTier: Int, Comparable {
    case basic = 0      // Classification, short extraction
    case standard = 1   // Summarization, simple generation
    case advanced = 2   // Multi-step reasoning, long-form analysis
}

Your on-device provider wraps Apple’s LanguageModelSession. Your cloud provider wraps the Anthropic SDK. The routing layer decides which one handles each request.

struct InferenceRouter {
    let onDevice: AIProvider
    let cloud: AIProvider
    
    func route(task: AITask) async throws -> String {
        if task.requiredTier <= onDevice.estimatedCapabilityTier
            && task.estimatedTokens < 512 {
            return try await onDevice.generate(
                prompt: task.prompt, maxTokens: task.estimatedTokens
            )
        }
        return try await cloud.generate(
            prompt: task.prompt, maxTokens: task.estimatedTokens
        )
    }
}

Where each provider wins

Factor	Apple on-device	Claude API
Latency	~50-200ms (no network)	500ms-3s (network dependent)
Privacy	Full — data never leaves device	Data sent to Anthropic servers
Cost per request	Zero	Token-based pricing
Reasoning depth	Limited — best for short, constrained tasks	Strong multi-step reasoning
Context window	Constrained by device memory	Up to 200K tokens
Availability	Works offline	Requires connectivity
Structured output	`@Generable` macro support	Tool use and JSON mode

The split is pretty intuitive once you see it laid out. Anything that fits in a short context and needs a quick answer — sentiment classification, entity extraction, auto-complete suggestions — on-device wins on every axis except raw capability. The moment you need chain-of-thought reasoning, long document analysis, or nuanced generation, send it to Claude.

Streaming with Combine for responsive UI

Both providers can stream tokens back. Wrapping them in a Combine pipeline keeps your UI responsive regardless of which provider is active:

func streamResponse(for task: AITask) -> AnyPublisher<String, Error> {
    let stream = router.routeStreaming(task: task)
    
    return stream
        .receive(on: DispatchQueue.main)
        .scan("") { accumulated, chunk in accumulated + chunk }
        .eraseToAnyPublisher()
}

Your SwiftUI view subscribes to a single publisher. It doesn’t care whether tokens are coming from Apple Silicon or from Claude’s API. The adapter handles that.

Token budget management

In production, you need guardrails. A simple budget manager prevents runaway cloud costs:

actor TokenBudgetManager {
    private var dailyCloudTokensUsed: Int = 0
    private let dailyLimit: Int = 100_000
    
    func canUseCloud(estimatedTokens: Int) -> Bool {
        dailyCloudTokensUsed + estimatedTokens <= dailyLimit
    }
    
    func recordUsage(_ tokens: Int) {
        dailyCloudTokensUsed += tokens
    }
}

When the cloud budget runs out, the router gracefully degrades to on-device only. Users still get responses, just simpler ones. That’s far better than a hard failure or a surprise bill.

The privacy boundary

This is the architectural decision that matters most. Define a clear data classification:

Tier 1, on-device only: health data, financial records, personal messages. Anything covered by privacy regulations or where users reasonably expect their data stays local.
Tier 2, cloud-eligible: generic content generation, public data analysis, non-personal queries.

Encode this in your routing logic, not in your feature code. The feature layer asks for “summarize this text.” The router checks the data classification before picking a provider. This keeps privacy enforcement centralized and auditable.

What to do with all this

Build the adapter layer now. Even if you only use one provider today, the protocol-based abstraction is nearly free and saves you a rewrite later. You swap providers without touching feature code.

Route by capability tier, not by gut feeling. Classify each AI task by its complexity requirements. Let the router decide based on token estimates, required reasoning depth, and privacy constraints.

Treat on-device as your baseline and cloud as your escalation path. Design for offline-first AI. When the network is unavailable or the token budget is spent, your app still works. Cloud inference is an enhancement, not a dependency.

I think the hybrid approach is the right default for most iOS apps shipping AI features today. It’s not a compromise between two options. It’s genuinely better than either one alone because you’re optimizing across latency, cost, privacy, and capability at the same time instead of picking one axis and hoping the others work out.

Apple Foundation Models SDK with Claude Code: Building Hybrid On-Device/Cloud AI Pipelines for iOS Apps in Swift

The on-device shift

Architecture: the tiered inference pipeline

Where each provider wins

Streaming with Combine for responsive UI

Token budget management

The privacy boundary

What to do with all this

TAGS: swift, ios, architecture, mobile, api

Related Posts

Apple Foundation Models SDK with Claude Code: Building Hybrid On-Device/Cloud AI Pipelines for iOS Apps in Swift

PostgreSQL generated columns: cut P99 latency 80%

Replacing Your Kubernetes Cluster with a Single SQLite-Backed Binary: The Litestream Replication Architecture That Runs Your SaaS on a $5 VPS