Skip to main content
3Nsofts logo3Nsofts
On-Device AI

Apple On-Device LLM: What Runs Locally, What Doesn't, and How to Build Around It

Apple's on-device LLM stack is capable, but not every Apple Intelligence request stays local. This guide maps Core ML, Foundation Models, Private Cloud Compute, hardware limits, and the architecture choices that keep sensitive data on-device.

By Ehsan Azish · 3NSOFTS·June 2026·9 min read·iOS 18.1+, macOS 15.1+, Core ML, Apple Foundation Models

Apple's on-device LLM story in 2026 is more capable than most founders realize — and more constrained than most marketing copy admits. If you're building a privacy-sensitive iOS or macOS app and need a clear map of what actually runs on the device versus what quietly phones home, this article covers it.

What "On-Device" Actually Means on Apple Hardware

Apple uses the term "on-device" across several distinct systems that behave very differently. Treating them as interchangeable is the first mistake teams make.

Three layers are worth separating:

Core ML — your own models, compiled to .mlpackage format, running entirely on the Neural Engine or GPU. No network dependency. No Apple server involvement. You ship the model weights inside the app bundle or download them once and store them locally.

Apple Foundation Models — the on-device language models introduced with Apple Intelligence. Inference runs locally on A17 Pro and M-series chips. Apple's Private Relay architecture means even requests routed to Apple's servers for more complex tasks use a blind relay — Apple cannot associate the request with your identity.

Private Cloud Compute — Apple's server-side extension for tasks the on-device model can't handle. Still Apple infrastructure, still privacy-preserving by design, but not local inference. Data leaves the device.

That boundary matters for compliance. Health records, legal documents, and financial data in regulated verticals often cannot leave the device at all — not even to privacy-preserving servers.

What Runs Locally in 2026

Core ML Models

Any model compiled with Core ML Tools runs fully on-device. This includes classification, regression, text embedding, named entity recognition, and generative models small enough to fit within memory constraints.

Quantized models in the 1–4 bit range run at under 10ms inference latency on the Apple Neural Engine — fast enough for real-time text analysis, document classification, and on-screen content understanding without any perceptible delay.

The practical ceiling for a locally bundled model is roughly 2–4 GB on iPhone, higher on iPad Pro and Mac. Beyond that, you're either streaming model weights or routing to Private Cloud Compute.

Apple Foundation Models API

The FoundationModels framework (iOS 18.1 and macOS 15.1 onward) provides structured output generation, guided decoding, and tool calling — all running on the on-device model. The model is approximately 3B parameters, optimized for Apple Silicon.

It handles summarization, classification, structured extraction from free text, and light reasoning tasks. It is not a replacement for GPT-4-class reasoning on complex multi-step problems.

The API enforces guardrails Apple controls. You cannot fine-tune the Foundation Model or modify its weights. You work within its capabilities as shipped.

What Does Not Run Locally

Private Cloud Compute handles requests that exceed the on-device model's capability. Apple routes these automatically when the on-device model determines it cannot satisfy the request with sufficient quality.

You cannot force a request to stay on-device through the Apple Intelligence API alone. If your compliance requirement is absolute — zero bytes off the device — you need Core ML with your own model weights. The Foundation Models API does not provide that guarantee.

Siri, most Apple Intelligence system-level features, and any third-party LLM integration (OpenAI, Anthropic, Gemini) are cloud-dependent by definition.

The Architecture Decision

The choice between Core ML and Apple Foundation Models is not primarily about capability. It's about control and compliance.

| | Core ML (your model) | Apple Foundation Models | |---|---|---| | Data leaves device | Never | Only via Private Cloud Compute | | Model customization | Full control | None | | Fine-tuning on user data | Yes, with on-device training | No | | Inference latency | Sub-10ms for quantized models | Typically 200–800ms for generation | | App Store compliance | Your responsibility | Apple-managed | | Minimum hardware | A12 Bionic and later | A17 Pro / M-series |

For health, fintech, and legal apps where the privacy requirement is absolute, Core ML with your own model is the only architecture that provides a verifiable guarantee. The on-device AI guide for Core ML and Apple platforms covers the implementation specifics in detail.

Building Around the Constraints

When Your Model Needs to Be Smaller Than You'd Like

Quantization is the primary tool. INT4 quantization via Core ML Tools reduces model size by roughly 4x versus FP32, with acceptable accuracy loss for most classification and embedding tasks. A model that was 8 GB becomes 2 GB. A model that was 2 GB becomes 500 MB — bundleable in an app.

The trade-off is task-specific. Quantization hits harder on tasks requiring precise numerical reasoning than on text classification. Testing on the actual target hardware matters more than benchmarks on developer machines.

When the On-Device Model Can't Handle the Task

Design the fallback explicitly rather than letting the system decide. If the task requires capability beyond what runs locally, give the user a clear choice: process with enhanced capability (which may use cloud) or process locally with reduced capability. Don't hide that decision inside the framework.

This matters especially in regulated verticals. Implicit routing to cloud infrastructure — even privacy-preserving infrastructure — creates compliance exposure if your privacy policy says otherwise.

Structured Output as a Reliability Layer

The Foundation Models API supports guided decoding with GenerationSchema. Use it. Free-form generation from an on-device model produces inconsistent structure. Constrained generation with a defined schema produces reliable, parseable output — which is what your app's data layer actually needs.

@Generable
struct DocumentClassification {
    let category: String
    let confidence: Double
    let requiresReview: Bool
}

This pattern works for document triage, form extraction, and any task where model output needs to map directly to a Swift type.

The Hardware Requirement Problem

Apple Foundation Models require A17 Pro or M-series. That excludes iPhone 15 standard, iPhone 14, and everything older — still a meaningful share of active devices depending on your target market.

Core ML has no such restriction. A quantized classification model runs on A12 Bionic (iPhone XS, 2018) and later. If your addressable market includes older hardware, Core ML with your own model is the only path to consistent on-device AI behavior.

Segment your analytics by device capability before committing to a Foundation Models-first architecture. If 30% of your target audience is on pre-A17 hardware, you need a fallback that isn't "no AI feature."

Privacy Architecture That Holds Under Scrutiny

The phrase "on-device AI" appears in a lot of product copy. It doesn't always mean what it implies.

A privacy guarantee that holds under scrutiny requires three things: the model weights are local, inference runs locally, and no derivative of the input — embeddings, logs, telemetry — is transmitted. Meeting two of three is not the same as meeting all three.

For the apps 3Nsofts builds — health tracking, financial analysis, field operations — the guarantee is zero bytes of user data sent to any server. That requires Core ML with locally stored weights, not a cloud LLM with a privacy policy. The comparison between Core ML and cloud API approaches breaks down the latency, cost, and compliance differences in concrete terms.

The off-grid AI case study shows what this looks like in a production app where connectivity is intermittent and the privacy requirement is non-negotiable.

What This Means for Your Build Decision

Apple Foundation Models give you a capable, privacy-respecting language model with no infrastructure cost and no API key management. The constraints are real — hardware minimums, no fine-tuning, no guaranteed local execution for every request.

Core ML gives you full control, verifiable privacy, and hardware reach back to 2018. The constraint is that you own the model selection, quantization, and evaluation work.

Most production apps in regulated verticals end up using both: Core ML for the privacy-critical inference path, Foundation Models for auxiliary features where the hardware requirement is acceptable and the task fits within its capability.

The 2026 overview of on-device AI across Core ML, Foundation Models, and the Neural Engine goes deeper on the architecture patterns for combining both.

If you're building in health, fintech, legal, or field-ops and need this architecture implemented without the research overhead, 3Nsofts builds exactly this. Fixed scope, published prices, no cloud dependency. More at 3nsofts.com.


FAQs

What is an Apple on-device LLM? An Apple on-device LLM is a language model that runs inference directly on Apple Silicon — the Neural Engine, GPU, or CPU — without sending data to a remote server. In 2026, this includes models compiled with Core ML Tools and the system-level Foundation Models introduced with Apple Intelligence.

Does Apple Intelligence run entirely on the device? Not always. Apple Intelligence uses an on-device model for most tasks on A17 Pro and M-series hardware. For tasks that exceed the on-device model's capability, it routes to Private Cloud Compute — Apple's server infrastructure. Requests processed there leave the device, though Apple's architecture is designed to prevent Apple itself from reading the content.

What is the difference between Core ML and Apple Foundation Models? Core ML is a framework for running your own compiled models on-device. You control the weights, the quantization, and the inference pipeline. Apple Foundation Models is an API for Apple's system-level language model. You don't control the weights and cannot fine-tune it, but you get structured output generation and tool calling without shipping your own model.

Which iPhones support Apple Foundation Models in 2026? Apple Foundation Models require A17 Pro or M-series chips — iPhone 15 Pro and Pro Max, iPhone 16 and later (all models), and all M-series iPads and Macs. iPhone 15 standard and all iPhone 14 models do not support Foundation Models inference.

Can I guarantee that no user data leaves the device? Yes, but only with Core ML using locally stored model weights. The Apple Foundation Models API can route requests to Private Cloud Compute for complex tasks, which means data leaves the device. If your compliance requirement is absolute — zero bytes off the device — Core ML with your own model is the only architecture that provides a verifiable guarantee.

How fast is on-device LLM inference on Apple Silicon? Quantized classification and embedding models running via Core ML inference at under 10ms on the Apple Neural Engine. Generative tasks using Apple Foundation Models typically take 200–800ms depending on output length and model load. For real-time UI interactions, that difference is significant — classification is fast enough to be invisible, generation is not.

What model formats does Core ML support? Core ML accepts models converted from PyTorch, TensorFlow, scikit-learn, and ONNX using Core ML Tools. The output format is .mlpackage. Models can be quantized to INT8, INT4, or lower precision during conversion to reduce size and improve Neural Engine throughput.

Authoritative References