Chapter 4 • Week 4

Performance Optimization for On-Device Inference

Performance work starts with budgets. This chapter walks through profiling methodology, model-level optimization, runtime controls, and concurrency-aware scheduling for stable user-facing latency.

Define SLO budgets per feature

- Interactive flows: p95 under 250 ms for partial response signal.
- Background enrichment: bounded by energy and thermal envelope, not immediate latency.
- Startup budget: model warm-up under 400 ms for primary path.
- Memory budget: no sustained growth after repeated inference loops.

Quantization and model variant strategy

Keep two variants for most production systems: default quality profile and fast fallback profile. Route based on latency budget and thermal state rather than hard-coding one global model.

enum ModelProfile {
    case quality
    case balanced
    case fast
}

func selectProfile(thermal: ProcessInfo.ThermalState, budgetMs: Int) -> ModelProfile {
    if thermal == .serious || thermal == .critical { return .fast }
    if budgetMs < 200 { return .balanced }
    return .quality
}

Thermal-aware scheduling in Swift 6

Use actor-owned scheduler state to avoid overlapping expensive calls. If thermal pressure increases, degrade concurrency level first before reducing quality profile. This keeps output consistency better.

- Observe thermal state changes and adjust queue depth.
- Enforce one heavy request per actor lane for large models.
- Cancel stale tasks when new user intent supersedes old predictions.

Measure what matters

Track startup, per-token/per-prediction latency, memory peaks, cancellation rates, and fallback frequency. Without this, teams optimize isolated benchmarks while user-facing tails keep degrading.

Authoritative References

Core MLCore ML documentationCore ML toolsSwiftUIObservation