The throughput of LLM serving systems is constrained by GPU memory, driven by the demand for large KV caches which grow rapidly as request volumes increase.
For instance, the KV cache for a single token from the 13B parameter OPT model requires 800 KB, leading to 1.6 GB memory needed for a full request.
Collection
[
|
...
]