Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Overall Results | HackerNoon
Briefly

Apparate significantly lowers latencies compared to vanilla model serving, achieving median speedups that range from 40.5-91.5% across CV workloads, maintaining a strict 1% accuracy constraint.
The efficiency of Apparate improves with larger model sizes, yielding notable latency savings, especially for extensive models like GPT-2 and BERT-large, emphasizing its tailored architecture.
Read at Hackernoon
[
|
]