Insights / On-Device AI
Core ML Optimization Techniques for Production iOS Apps
Shipping a Core ML model is the easy part. Shipping one that's fast enough, small enough, and accurate enough for production is where most teams get stuck. This is the optimization path that works.
Why Core ML models need optimization
A PyTorch model converted to Core ML with default settings ships with FP32 weights — the same precision used during training. On a device with 6GB RAM and tight battery constraints, an unoptimized 100MB image classifier is a problem. The Neural Engine doesn't use FP32 weights natively; they get down-cast at runtime, wasting memory bandwidth and preventing the best performance.
Apple's coremltools Python package provides a full optimization pipeline accessible via the ct.optimize namespace. As documented in WWDC 2023 — Optimize your Core ML usage, applying these techniques together can achieve 4x smaller models with less than 2% accuracy loss.
The four optimization techniques
1. FP16 precision conversion
The lowest-risk optimization. Converting from FP32 to FP16 halves model size with negligible accuracy loss for most architectures. It's the baseline that should always be applied:
import coremltools as ct
model = ct.convert(
torch_model,
convert_to="mlprogram",
compute_precision=ct.precision.FLOAT16, # FP16 weights
inputs=[ct.TensorType(shape=(1, 3, 224, 224))]
)The mlprogram format (vs the older neuralnetwork format) is required for FP16 and all subsequent optimizations. Use it for any new model targeting iOS 15+.
2. Palettization (weight clustering)
Palettization groups all weight values in a layer into N clusters, storing each weight as an index into a shared lookup table. At 8-bit palettization (256 clusters), model size drops to roughly 25% of FP32. At 4-bit (16 clusters), it drops to ~12% — with accuracy loss that's task-dependent.
from coremltools.optimize.coreml import palettize_weights, OptimizationConfig
from coremltools.optimize.coreml import OpPalettizerConfig
config = OptimizationConfig(
global_config=OpPalettizerConfig(nbits=8, mode="kmeans")
)
optimized = palettize_weights(model, config=config)K-means palettization produces better accuracy than linear quantization at the same bit width, because cluster centroids are optimized to the actual weight distribution. Use it for convolutional layers and attention projections. Validate accuracy on your test set after each reduction step.
3. Linear quantization (INT8)
Post-training quantization (PTQ) converts weights to INT8 using linear scaling. Faster to apply than palettization but typically has slightly higher accuracy loss. Use ct.optimize.coreml.linear_quantize_weights with a calibration dataset for activation quantization — without calibration, only weight quantization applies.
For quantization-sensitive architectures (transformers, models with batch normalization), prefer palettization at 8-bit to linear INT8. The accuracy-size tradeoffs differ significantly by architecture.
4. Pruning
Pruning sets weights below a magnitude threshold to zero. At high sparsity (70%+), this enables significant compression because zero values compress extremely well. Apple's coremltools supports structured and unstructured pruning via ct.optimize.torch.pruning (applied before conversion) and ct.optimize.coreml.prune_weights (post-conversion).
Pruning alone doesn't reduce inference time unless the compute graph can skip zero operations — which requires hardware support for sparse computation. On Apple Neural Engine, structured sparsity does accelerate inference. Unstructured sparsity primarily reduces model size and memory bandwidth.
Targeting the Apple Neural Engine
The ANE delivers the best performance-per-watt of any compute path, but not all operations are ANE-compatible. Core ML will automatically fall back to GPU or CPU for unsupported ops. Falling back for a few ops is fine; falling back for the entire model because of one unsupported layer is a common performance bug.
Use Xcode's Core ML Performance Report (via the Core ML model inspector) to see per-layer compute unit assignment. The key rules:
- —Use
mlprogramformat, notneuralnetwork— ANE support is significantly better in mlprogram. - —Avoid custom PyTorch extensions and exotic activation functions — replace with ANE-compatible equivalents (ReLU, GELU, Sigmoid are all fine).
- —Keep batch size 1 for interactive inference — the ANE is optimized for small batch, high-frequency requests.
- —Profile on physical hardware, not the simulator. Simulator doesn't have an ANE and performance characteristics are completely different.
Set compute units explicitly in MLModelConfiguration for predictable behavior:
let config = MLModelConfiguration() config.computeUnits = .cpuAndNeuralEngine // exclude GPU for latency-optimized tasks let model = try MyModel(configuration: config)
Measured impact: before and after
Typical results from applying the full pipeline (FP16 → 8-bit palettization → ANE targeting) to a MobileNetV3-Large model:
| Configuration | Model size | Latency (A15) | Top-1 accuracy |
|---|---|---|---|
| FP32 (baseline) | 22 MB | 8.2 ms | 75.2% |
| FP16 | 11 MB | 2.1 ms | 75.2% |
| FP16 + 8-bit palette | 5.8 MB | 1.8 ms | 74.9% |
| FP16 + 4-bit palette | 3.1 MB | 1.6 ms | 73.8% |
Results on iPhone 13 Pro (A15 Bionic) running iOS 17, ImageNet validation set subset. Latency measured as median of 100 predictions. Accuracy loss at 8-bit palette is within measurement noise for most production use cases.
The optimization workflow
The recommended sequence for any new Core ML model going to production:
- 1.Convert with FP16 and mlprogram format. Run accuracy validation. This is your baseline.
- 2.Profile with Xcode's Core ML Performance Report. Identify layers on CPU that shouldn't be. Fix ANE-incompatible ops.
- 3.Apply 8-bit palettization. Validate accuracy again. If within tolerance, this is your production artifact.
- 4.If 8-bit model is still too large, try 4-bit palettization. Measure accuracy loss carefully — it varies significantly by architecture and task.
- 5.Benchmark on the oldest supported device in your target hardware profile, not the newest. Ship the model that meets specs on the worst-case device.
Common questions
What is Core ML model quantization?
Quantization reduces the numerical precision of model weights from 32-bit floats (FP32) to lower precision formats — typically 16-bit floats (FP16) or 8-bit integers (INT8). This reduces model file size by 2x–4x and improves inference speed on Apple Neural Engine. The coremltools Python package provides palettization and linear quantization via ct.optimize APIs.
How do I target the Apple Neural Engine with Core ML?
Core ML automatically routes compatible operations to the Neural Engine. To maximize ANE usage: use FP16 precision and mlprogram format, profile with Xcode's Core ML Performance Report to see which operations land on each compute unit, and fix any custom layers that force CPU fallback.
What is the difference between pruning and palettization in Core ML?
Pruning sets low-magnitude weights to zero, creating sparse weight matrices. Palettization groups weights into shared values stored as indices plus a lookup table. Both reduce model size; palettization often achieves better accuracy-size tradeoffs at 4-bit to 8-bit precision and is the recommended approach for most production models.
Related Articles
Need Core ML expertise on your project?
We optimize Core ML models for production — from size reduction to ANE targeting to architecture decisions that prevent performance issues before they happen.