Skip to main content
3Nsofts logo3Nsofts

Insights / On-Device AI

On-Device AI Performance Benchmarks: Apple Silicon vs Cloud APIs

Benchmark numbers without methodology are noise. Here's how to measure Core ML performance correctly, what real results look like across Apple hardware, and how to use this data to make shipping decisions.

By Ehsan Azish · 3NSOFTS · March 2026

Benchmark methodology

Bad benchmark methodology produces misleading numbers that lead to bad decisions. The following methodology produces results that generalize to real app behavior:

  • Physical hardware only. The iOS Simulator doesn't have a Neural Engine. All latency measurements must be taken on a physical device. Simulator benchmarks are meaningless for on-device AI.
  • Warm-up runs excluded. The first 1–3 predictions after model load include model compilation time (when using .mlpackage ). Measure steady-state latency after at least 5 warm-up predictions.
  • Median, not average. Core ML latency has a non-normal distribution — occasional outliers from thermal throttling or system activity skew averages. Use median of at least 50 measurements.
  • Thermal state controlled. Sustained inference generates heat. Benchmark at both ambient temperature and after 30 seconds of sustained load to measure throttled performance — what users experience during extended use.
  • Compute units declared explicitly. Use MLModelConfiguration to pin compute units. Benchmark both .all and .cpuAndNeuralEngine — sometimes excluding the GPU produces faster results for latency-sensitive tasks.

Apple's official benchmarking tool is the Core ML Performance Report in Xcode, available by clicking the Performance tab in the model inspector. It runs on-device benchmarks automatically and reports per-layer compute unit assignment.

Core ML inference latency across Apple Silicon

The following table shows median inference latency for standard model architectures across Apple Silicon generations. All measurements use FP16 precision, mlprogram format, compute units .cpuAndNeuralEngine, batch size 1.

ModelA13A15A16A17 ProM2
MobileNetV3-Large4.2ms2.1ms1.5ms<1ms<1ms
EfficientNet-B08.3ms3.8ms2.9ms1.8ms1.5ms
YOLOv8n (detection)12ms6ms4.5ms3ms2.5ms
BERT-Tiny (NLP)18ms8ms6ms4ms3ms
DistilBERT-Base95ms42ms31ms18ms15ms

Median of 100 measurements at ambient temperature, physical devices, iOS 17 / macOS 14. Input size: 224×224 for vision models, 128 tokens for NLP. Values rounded to nearest 0.5ms.

On-device vs cloud inference latency

For practical decision-making, the comparison is total latency including network round-trip. Cloud inference adds 100–400ms of latency under typical conditions — not just API processing time.

Inference typeLatency (p50)Latency (p95)Offline?
Core ML on A15 (ANE)2ms4ms
Core ML on A17 Pro (ANE)<1ms2ms
Cloud API (WiFi, US)150ms400ms
Cloud API (4G LTE)250ms800ms
Foundation Models (A17 Pro)20–80ms*120ms*

* Foundation Models first-token latency for short prompts. Cloud API latency measured from real apps to GPT-4o (US East region) under normal load conditions.

Quantization impact on performance

Quantization affects both model size and inference latency via two mechanisms: reduced memory bandwidth (fewer bytes to load) and native precision matching (ANE runs FP16 natively). The combined effect depends on whether your model is compute-bound or memory-bandwidth-bound.

PrecisionSize (MobileNetV3)Latency (A15)Top-1 (ImageNet)
FP32 baseline22 MB8.2ms75.2%
FP1611 MB2.1ms75.2%
8-bit palettization5.8 MB1.8ms74.9%
4-bit palettization3.1 MB1.6ms73.8%
2-bit palettization1.6 MB1.5ms69.1%

Measurements on iPhone 13 Pro (A15), iOS 17, 100-run median. Accuracy measured on 10K ImageNet validation subset. 2-bit palettization accuracy loss (6.1 points) is significant and should only be used when model size is the hard constraint.

The practical recommendation: use FP16 as the default. Add 8-bit palettization if the model is over 10MB. Only go to 4-bit if size constraints demand it — validate accuracy carefully first. See the full optimization workflow in Core ML Optimization Techniques.

Compute unit impact

The choice of compute units in MLModelConfiguration has a significant impact on latency. The ANE is fastest for latency-optimized use cases; the GPU is fastest for throughput-optimized (multiple concurrent inferences). The CPU is usually slowest but most compatible.

Compute unitsMobileNetV3 (A15)Power impactBest for
.cpuOnly28msModerateDebugging, compatibility
.cpuAndGPU3.5msHighLarge batches, throughput
.cpuAndNeuralEngine2.1msLowInteractive inference ✓
.all2.1msLowGeneral use (system decides)

For interactive single-inference use cases (user taps, camera classification), use .cpuAndNeuralEngine. It achieves the same latency as .all while consuming significantly less power — important for sustained use on battery.

Using benchmark data for shipping decisions

Benchmarks answer one question: is the model fast enough on the target hardware? Define your performance budget before looking at benchmarks — what latency is acceptable for your use case?

  • Real-time camera inference: budget of ~16ms (60fps) or ~33ms (30fps) for the inference step. Most mobilenet-class models on A13+ are well within this.
  • Interactive single prediction: under 200ms total from user action to displayed result. Inference latency is rarely the bottleneck — image preprocessing and UI updates usually take longer.
  • Background batch processing: no hard latency constraint — optimize for throughput (images/second) and battery impact while a background task runs.

Benchmark on the oldest device in your supported hardware range. If you support iOS 15, test on an A12 device — not on the latest hardware. Performance gaps between generations are significant.

Common questions

How fast is Core ML inference on iPhone?

For MobileNetV3-Large: 4ms on A13 Bionic, 2ms on A15, under 1ms on A17 Pro. For EfficientDet object detection: 12–40ms depending on chip. All at FP16 precision, batch size 1, on Neural Engine. Profile on physical hardware — simulator performance is not representative.

Is on-device AI faster than cloud inference?

For classification and detection on modern Apple Silicon: yes, consistently. Cloud inference adds 100–400ms of network latency. On-device inference on A15+ is 1–10ms total with no network dependency. For GPT-4 class language models, cloud remains more practical — those don't fit on device.

How much does quantization improve Core ML performance?

FP32 to FP16 improves inference latency 2x–4x on ANE because ANE natively computes in FP16. 8-bit palettization reduces model size ~4x with minimal latency change. 4-bit palettization reduces size ~8x but introduces measurable accuracy loss — validate on your validation set.

Related Articles

Performance engineering for your AI feature

Meeting a latency budget requires the right model architecture, optimization configuration, and Apple hardware knowledge from the start. We've done this in production.