offgrid:AI: Shipping Fully Offline LLM Inference on iOS
Building an AI assistant that runs entirely on-device — no cloud API, no server costs, no data transmission — required solving model storage, memory constraints, inference speed, and battery life simultaneously.
Stack
SwiftUI · llama.cpp · Core ML
Platform
iOS · On App Store
Performance
18–22% battery/hr sustained
Data sent
0 bytes to any server
Context
In 2024, every AI assistant app on iOS required an active internet connection and transmitted user prompts to cloud infrastructure. The market assumption was that language model inference was too compute-intensive to run on a mobile device. The use cases for a genuinely offline AI assistant were real and unserved: field workers without reliable connectivity, travelers in areas with high data costs, privacy-conscious users who would not send prompts to a cloud API, and emergency preparedness scenarios where connectivity cannot be assumed.
Problem
The technical barriers to production-viable on-device LLM inference on iOS in 2024 were not theoretical — they were real constraints that had to be solved simultaneously:
- —Model size: a usable language model is 3–16 GB. That's a significant portion of a device's storage.
- —Memory: LLM inference requires holding model weights and the KV cache in memory simultaneously. The iPhone's unified memory architecture helps, but context window size is directly limited by available RAM.
- —Battery: sustained inference draws significant CPU and Neural Engine power. An app that drains 50% battery per hour is not useful.
- —Apple Foundation Models framework: not available until iOS 26. A cross-version strategy was required.
- —App Store: Apple's guidelines restrict some model hosting patterns. Approval required deliberate preparation.
Architecture
Inference Engine: llama.cpp
llama.cpp was the only production-viable path for local LLM inference on iOS prior to Apple Foundation Models. It provides a C/C++ implementation of LLaMA inference with GGUF format support, optimized for the NEON instruction set used by Apple Silicon. The Swift integration layer wraps the C API, manages model lifecycle (load on first use, unload to free memory when backgrounded), and bridges llama.cpp's token callback to Swift's AsyncStream<String>.
Quantization Strategy
The quantization-quality trade-off defines the user experience. Models below Q4 produce noticeably degraded output — users perceive the quality drop. Models above Q5 exceed practical on-device storage for most users. The app ships with Q4_K_M (approximately 4.5 GB) as the primary model and Q5_K_M (approximately 5.5 GB) as an optional higher-quality variant.
Q4_K_M
~4.5 GB
Good output
18–22%/hr
Q5_K_M
~5.5 GB
Very Good output
20–25%/hr
Q8_0
~8.5 GB
Near-Full output
28–35%/hr
Battery-Aware Scheduling
LLM inference is not interruptible at arbitrary points — a token generation in progress must complete. The battery scheduler observes two signals: UIDevice.current.batteryLevel and ProcessInfo.processInfo.thermalState. If battery drops below 15% during generation, the current response completes and a UI warning is shown before the next request. If thermal state is .serious or .critical, inference CPU thread count is halved — reducing throughput but preventing the device from throttling the processor mid-generation.
Model Storage & Download UX
Models are stored in the app's documents directory using FileManager — they survive app updates, are excluded from iCloud backup (to avoid consuming the user's iCloud storage quota), and are not purged by the system's storage reclamation. The download UX is a first-run flow, not a gate: the user sees the exact download size before committing. Downloads use URLSession background download tasks with progress tracking and automatic resume on failure.
Implementation: Token Streaming to SwiftUI
llama.cpp produces tokens via a C callback. Bridging that to SwiftUI's reactive update model requires an AsyncStream that emits each token as it's generated:
// Bridge llama.cpp token callback to Swift AsyncStream
actor InferenceEngine {
private var model: OpaquePointer?
private var context: OpaquePointer?
func generate(prompt: String) -> AsyncStream<String> {
AsyncStream { continuation in
Task.detached(priority: .userInitiated) { [weak self] in
guard let self else { return }
let tokens = await self.tokenize(prompt)
var response = ""
for token in await self.generateTokens(from: tokens) {
// Check thermal state before each token
let thermal = ProcessInfo.processInfo.thermalState
if thermal == .critical {
await self.throttleInference()
}
let piece = await self.tokenToPiece(token)
response += piece
continuation.yield(piece)
}
continuation.finish()
}
}
}
}
// In SwiftUI
struct ChatView: View {
@State private var response = ""
var body: some View {
Text(response)
.task {
for await token in engine.generate(prompt: userMessage) {
response += token
}
}
}
}Outcome
Shipped on the App Store with full offline inference. Users install a 4–5 GB model once and run open-ended conversations, document summarization, and code explanation entirely on-device — without an internet connection, without paying per API call, without their prompts being transmitted anywhere.
- →Live on the App Store — approved in standard review time with the offline inference architecture
- →Battery consumption on iPhone 15 Pro: 18–22% per hour at sustained Q4_K_M inference
- →0 bytes transmitted to any server during inference — zero network entitlement required at inference time
- →Resumable model download: interruptions don't require starting over from scratch
- →Zero cloud infrastructure costs — no API, no server, no rate limits
- →架构已为 Apple Foundation Models (iOS 26+) 迁移路径预留接口 — llama.cpp layer is swappable
"The technical constraint that defined the architecture: you cannot trade inference quality for model size beyond a threshold — below Q4, users notice degraded output. The solution lives in the quantization-quality curve, not at the extremes."
Key Technical Learnings
KV cache size is the real memory constraint
The model weights load once and stay largely static. The KV cache grows with context length — a 4K token context at Q4 can add 500 MB of memory pressure. Limit context window aggressively for chat use cases; rolling summarization is more practical than unlimited context.
Background download tasks, not foreground
A 4 GB model download in the foreground blocks the app and fails if the user switches away. URLSession background download tasks continue even when the app is backgrounded, and resume automatically if the connection drops. This is the only viable model for large asset downloads.
Thermal state is more actionable than battery level
Battery level tells you about future capacity; thermal state tells you about current load. When ProcessInfo.thermalState is .serious, the device is already throttling. Reducing inference threads before reaching .critical produces better sustained throughput than waiting for iOS to forcibly throttle the process.
Design for Foundation Models migration from day one
The inference interface is abstracted behind a protocol. llama.cpp is one concrete implementation. When Apple Foundation Models became available on iOS 26, adding a Foundation Models implementation required changing only the protocol conformance — the rest of the app was inference-engine agnostic.
Technical FAQ
What's the difference between llama.cpp and Apple Foundation Models for iOS?↓
How does the app handle the App Store's large asset size concerns?↓
Could this be built using Apple Foundation Models instead of llama.cpp today?↓
Adding on-device AI to an existing iOS app?
The On-Device AI Integration service covers model selection, Swift integration, inference architecture, and production deployment — the same stack used in offgrid:AI.