Insights / On-Device AI
On-Device AI Performance Benchmarks: Apple Silicon vs Cloud APIs
Benchmark numbers without methodology are noise. Here's how to measure Core ML performance correctly, what real results look like across Apple hardware, and how to use this data to make shipping decisions.
Benchmark methodology
Bad benchmark methodology produces misleading numbers that lead to bad decisions. The following methodology produces results that generalize to real app behavior:
- —Physical hardware only. The iOS Simulator doesn't have a Neural Engine. All latency measurements must be taken on a physical device. Simulator benchmarks are meaningless for on-device AI.
- —Warm-up runs excluded. The first 1–3 predictions after model load include model compilation time (when using
.mlpackage). Measure steady-state latency after at least 5 warm-up predictions. - —Median, not average. Core ML latency has a non-normal distribution — occasional outliers from thermal throttling or system activity skew averages. Use median of at least 50 measurements.
- —Thermal state controlled. Sustained inference generates heat. Benchmark at both ambient temperature and after 30 seconds of sustained load to measure throttled performance — what users experience during extended use.
- —Compute units declared explicitly. Use
MLModelConfigurationto pin compute units. Benchmark both.alland.cpuAndNeuralEngine— sometimes excluding the GPU produces faster results for latency-sensitive tasks.
Apple's official benchmarking tool is the Core ML Performance Report in Xcode, available by clicking the Performance tab in the model inspector. It runs on-device benchmarks automatically and reports per-layer compute unit assignment.
Core ML inference latency across Apple Silicon
The following table shows median inference latency for standard model architectures across Apple Silicon generations. All measurements use FP16 precision, mlprogram format, compute units .cpuAndNeuralEngine, batch size 1.
| Model | A13 | A15 | A16 | A17 Pro | M2 |
|---|---|---|---|---|---|
| MobileNetV3-Large | 4.2ms | 2.1ms | 1.5ms | <1ms | <1ms |
| EfficientNet-B0 | 8.3ms | 3.8ms | 2.9ms | 1.8ms | 1.5ms |
| YOLOv8n (detection) | 12ms | 6ms | 4.5ms | 3ms | 2.5ms |
| BERT-Tiny (NLP) | 18ms | 8ms | 6ms | 4ms | 3ms |
| DistilBERT-Base | 95ms | 42ms | 31ms | 18ms | 15ms |
Median of 100 measurements at ambient temperature, physical devices, iOS 17 / macOS 14. Input size: 224×224 for vision models, 128 tokens for NLP. Values rounded to nearest 0.5ms.
On-device vs cloud inference latency
For practical decision-making, the comparison is total latency including network round-trip. Cloud inference adds 100–400ms of latency under typical conditions — not just API processing time.
| Inference type | Latency (p50) | Latency (p95) | Offline? |
|---|---|---|---|
| Core ML on A15 (ANE) | 2ms | 4ms | ✓ |
| Core ML on A17 Pro (ANE) | <1ms | 2ms | ✓ |
| Cloud API (WiFi, US) | 150ms | 400ms | ✗ |
| Cloud API (4G LTE) | 250ms | 800ms | ✗ |
| Foundation Models (A17 Pro) | 20–80ms* | 120ms* | ✓ |
* Foundation Models first-token latency for short prompts. Cloud API latency measured from real apps to GPT-4o (US East region) under normal load conditions.
Quantization impact on performance
Quantization affects both model size and inference latency via two mechanisms: reduced memory bandwidth (fewer bytes to load) and native precision matching (ANE runs FP16 natively). The combined effect depends on whether your model is compute-bound or memory-bandwidth-bound.
| Precision | Size (MobileNetV3) | Latency (A15) | Top-1 (ImageNet) |
|---|---|---|---|
| FP32 baseline | 22 MB | 8.2ms | 75.2% |
| FP16 | 11 MB | 2.1ms | 75.2% |
| 8-bit palettization | 5.8 MB | 1.8ms | 74.9% |
| 4-bit palettization | 3.1 MB | 1.6ms | 73.8% |
| 2-bit palettization | 1.6 MB | 1.5ms | 69.1% |
Measurements on iPhone 13 Pro (A15), iOS 17, 100-run median. Accuracy measured on 10K ImageNet validation subset. 2-bit palettization accuracy loss (6.1 points) is significant and should only be used when model size is the hard constraint.
The practical recommendation: use FP16 as the default. Add 8-bit palettization if the model is over 10MB. Only go to 4-bit if size constraints demand it — validate accuracy carefully first. See the full optimization workflow in Core ML Optimization Techniques.
Compute unit impact
The choice of compute units in MLModelConfiguration has a significant impact on latency. The ANE is fastest for latency-optimized use cases; the GPU is fastest for throughput-optimized (multiple concurrent inferences). The CPU is usually slowest but most compatible.
| Compute units | MobileNetV3 (A15) | Power impact | Best for |
|---|---|---|---|
| .cpuOnly | 28ms | Moderate | Debugging, compatibility |
| .cpuAndGPU | 3.5ms | High | Large batches, throughput |
| .cpuAndNeuralEngine | 2.1ms | Low | Interactive inference ✓ |
| .all | 2.1ms | Low | General use (system decides) |
For interactive single-inference use cases (user taps, camera classification), use .cpuAndNeuralEngine. It achieves the same latency as .all while consuming significantly less power — important for sustained use on battery.
Using benchmark data for shipping decisions
Benchmarks answer one question: is the model fast enough on the target hardware? Define your performance budget before looking at benchmarks — what latency is acceptable for your use case?
- —Real-time camera inference: budget of ~16ms (60fps) or ~33ms (30fps) for the inference step. Most mobilenet-class models on A13+ are well within this.
- —Interactive single prediction: under 200ms total from user action to displayed result. Inference latency is rarely the bottleneck — image preprocessing and UI updates usually take longer.
- —Background batch processing: no hard latency constraint — optimize for throughput (images/second) and battery impact while a background task runs.
Benchmark on the oldest device in your supported hardware range. If you support iOS 15, test on an A12 device — not on the latest hardware. Performance gaps between generations are significant.
Common questions
How fast is Core ML inference on iPhone?
For MobileNetV3-Large: 4ms on A13 Bionic, 2ms on A15, under 1ms on A17 Pro. For EfficientDet object detection: 12–40ms depending on chip. All at FP16 precision, batch size 1, on Neural Engine. Profile on physical hardware — simulator performance is not representative.
Is on-device AI faster than cloud inference?
For classification and detection on modern Apple Silicon: yes, consistently. Cloud inference adds 100–400ms of network latency. On-device inference on A15+ is 1–10ms total with no network dependency. For GPT-4 class language models, cloud remains more practical — those don't fit on device.
How much does quantization improve Core ML performance?
FP32 to FP16 improves inference latency 2x–4x on ANE because ANE natively computes in FP16. 8-bit palettization reduces model size ~4x with minimal latency change. 4-bit palettization reduces size ~8x but introduces measurable accuracy loss — validate on your validation set.
Related Articles
Performance engineering for your AI feature
Meeting a latency budget requires the right model architecture, optimization configuration, and Apple hardware knowledge from the start. We've done this in production.