High throughput serving of large language models requires efficiently managing memory, particularly the key-value cache, to prevent fragmentation and duplication, which limits batch sizes.
PagedAttention is a novel attention algorithm that leverages virtual memory techniques, allowing for nearly zero waste in KV cache memory and flexible sharing across requests.
Our evaluations demonstrate that vLLM significantly enhances the throughput of popular LLMs by 2-4×, while maintaining comparable latency, especially beneficial for longer sequences and larger models.
By implementing vLLM, we provide an innovative solution to the memory challenges in LLM serving, making it a competitive alternative to existing systems like FasterTransformer and Orca.
Collection
[
|
...
]