Lab Findings / On-Device ML

Core ML Quantization Trade-offs: FP32 to INT4 in Production

What we measured as we moved classification and text models from FP32 to FP16 to INT8 to INT4 precision on Core ML — and the accuracy cliff that determined where we stopped.

Lab Finding · 3NSOFTS · January 2026 · Status: Research applied to production

Why quantization matters for iOS

On-device ML on iOS has two primary constraints: model size (storage and RAM) and inference speed (battery life, latency). Quantization addresses both by reducing the numerical precision of model weights:

—FP32 (32-bit float): Full precision, largest size, highest accuracy
—FP16 (16-bit float): Half the size of FP32, negligible accuracy drop for most models
—INT8 (8-bit integer): Quarter of FP32 size, small accuracy drop, but must be applied carefully
—INT4 (4-bit integer): Eighth of FP32 size, significant accuracy drop on complex tasks — see below

Apple's coremltools supports quantization via the ct.compression_utils API. The conversion process is straightforward; evaluating the quality trade-off at each level requires domain-specific testing.

FP32 → FP16: take this one unconditionally

FP16 reduces model size by 50% with negligible accuracy impact on classification tasks, text embedding, and most vision models. The Apple Neural Engine natively executes FP16 models — there is no accuracy penalty from hardware execution. FP32 models may be down-cast at inference anyway on Neural Engine hardware; shipping FP16 gives you the size benefit without the runtime cast.

The only case where FP32 matters over FP16: models where specific neurons operate near the boundary of float16 numerical range. For standard classification and embedding models trained on common frameworks, this is not a practical concern.

Recommendation: Always ship FP16. There is no downside for iOS targets.

FP16 → INT8: apply per model class

INT8 is where quantization becomes a domain-specific decision. Results we measured:

✓Image classification models: top-1 accuracy drop of 0.3–0.8% for EfficientNet-Lite class models. Acceptable for most production use cases. Model size drops from ~20 MB to ~5 MB in our tests.
✓Object detection (YOLO-class): mAP drop of 1–2%. Acceptable for non-safety applications. Inference latency improvement significant — detect operations that took 90ms at FP16 ran at 35–45ms at INT8 on Neural Engine.
⚠Text embedding models: Cosine similarity between embeddings degraded perceptibly at INT8 in our tests. Semantic search quality dropped measurably on boundary cases. Evaluate carefully before applying INT8 to embedding models.
✗Small models (<5 MB): INT8 quantization of small models often hurts more than it helps — the model is already small, and the accuracy degradation is proportionally larger because fewer weights carry the full representation.

FP16 → INT4: the accuracy cliff

INT4 quantization is where we encountered what we are calling the accuracy cliff: a non-linear degradation in output quality beyond INT8 that is not predicted by the quantization ratio alone.

For a food classification model used in a dietary tracking app (custom-trained, 80 categories), accuracy results:

Precision	Top-1 Accuracy	Model Size	Inference (iPhone 15)
FP32	91.2%	48 MB	22ms
FP16	91.1%	24 MB	12ms
INT8	90.4%	12 MB	8ms
INT4	81.7%	6 MB	6ms

A 9.5-point accuracy drop at INT4 — versus a cumulative 0.8-point drop from FP32 to INT8. The size saving from INT8→INT4 was 6 MB; the accuracy cost was not proportional to that saving. For the food classification use case, a 90.4% → 81.7% drop meant visible misclassification in user interaction. We shipped INT8.

Neural Engine vs CPU: which to target

The Apple Neural Engine (ANE) executes FP16 and INT8 models natively and provides significant speed and power efficiency improvements over CPU execution. To target ANE:

—Use FP16 or INT8 precision — FP32 models may fall back to CPU on some architectures
—Avoid operations not supported by ANE: custom layers, unsupported activation functions. Use Xcode's coremlc compiler output to verify ANE execution
—ANE is not always the fastest path for small models (<1MB). The overhead of ANE scheduling can exceed the saving for very small models running infrequently

Battery-aware inference: we implemented a ProcessInfo.processInfo.isLowPowerModeEnabled check before initiating inference that queues requests or throttles frequency when the device is in Low Power Mode. ANE is efficient enough that this only matters for continuous inference (real-time video analysis); for batch or on-demand inference, the power draw is negligible.

Production recommendation

Start with FP16. Measure. If model size is still a constraint, apply INT8 to vision and detection models. Do not apply INT4 to custom-trained task-specific models without measuring the accuracy impact on your specific task — the cliff is real and not predictable from the quantization ratio alone.

INT4 may be appropriate for general-purpose text generation models (where quality degrades gracefully) — this is the tradeoff llama.cpp exploits for ECHO Survival AI. For task-specific classification where correctness per prediction matters, stay at INT8.

Foundation Models vs Core ML →Back to Lab