Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Abstract and Introduction | HackerNoon
Briefly

Apparate is a system that automatically applies and manages early exits (EEs) in ML models, allowing certain inputs to exit with results at intermediate layers.
By providing continual feedback through repurposed exits, Apparate enables several novel runtime monitoring and adaptation strategies that are crucial for optimizing ML inference.
Our evaluation shows that Apparate can lower median response latencies by 40.5-91.5% for computer vision workloads and 10.0-24.2% for natural language processing tasks.
Despite the advances in latency reduction, Apparate maintains throughputs and adheres to strict accuracy constraints, addressing the fundamental challenges of ML inference platforms.
Read at Hackernoon
[
|
]