Performance Optimization for On-Device Inference
Performance work starts with budgets. This chapter walks through profiling methodology, model-level optimization, runtime controls, and concurrency-aware scheduling for stable user-facing latency.
Define SLO budgets per feature
- - Interactive flows: p95 under 250 ms for partial response signal.
- - Background enrichment: bounded by energy and thermal envelope, not immediate latency.
- - Startup budget: model warm-up under 400 ms for primary path.
- - Memory budget: no sustained growth after repeated inference loops.
Quantization and model variant strategy
Keep two variants for most production systems: default quality profile and fast fallback profile. Route based on latency budget and thermal state rather than hard-coding one global model.
enum ModelProfile {
case quality
case balanced
case fast
}
func selectProfile(thermal: ProcessInfo.ThermalState, budgetMs: Int) -> ModelProfile {
if thermal == .serious || thermal == .critical { return .fast }
if budgetMs < 200 { return .balanced }
return .quality
}Thermal-aware scheduling in Swift 6
Use actor-owned scheduler state to avoid overlapping expensive calls. If thermal pressure increases, degrade concurrency level first before reducing quality profile. This keeps output consistency better.
- - Observe thermal state changes and adjust queue depth.
- - Enforce one heavy request per actor lane for large models.
- - Cancel stale tasks when new user intent supersedes old predictions.
Measure what matters
Track startup, per-token/per-prediction latency, memory peaks, cancellation rates, and fallback frequency. Without this, teams optimize isolated benchmarks while user-facing tails keep degrading.