PagedAttention: An Attention Algorithm Inspired By the Classical Virtual Memory in Operating Systems

from Hackernoon 1 year ago

High throughput serving of large language models requires efficiently managing memory, particularly the key-value cache, to prevent fragmentation and duplication, which limits batch sizes.
Hackernoonhttps://hackernoon.com/pagedattention-an-attention-algorithm-inspired-by-the-classical-virtual-memory-in-operating-systems

PagedAttention is a novel attention algorithm that leverages virtual memory techniques, allowing for nearly zero waste in KV cache memory and flexible sharing across requests.
Hackernoonhttps://hackernoon.com/pagedattention-an-attention-algorithm-inspired-by-the-classical-virtual-memory-in-operating-systems

Our evaluations demonstrate that vLLM significantly enhances the throughput of popular LLMs by 2-4×, while maintaining comparable latency, especially beneficial for longer sequences and larger models.
Hackernoonhttps://hackernoon.com/pagedattention-an-attention-algorithm-inspired-by-the-classical-virtual-memory-in-operating-systems

By implementing vLLM, we provide an innovative solution to the memory challenges in LLM serving, making it a competitive alternative to existing systems like FasterTransformer and Orca.
Hackernoonhttps://hackernoon.com/pagedattention-an-attention-algorithm-inspired-by-the-classical-virtual-memory-in-operating-systems

Read at Hackernoon

#large-language-models #memory-management #pagedattention #vllm #llm-serving

Collection

[

...

]

PagedAttention: An Attention Algorithm Inspired By the Classical Virtual Memory in Operating Systems | HackerNoonPagedAttention: An Attention Algorithm Inspired By the Classical Virtual Memory in Operating Systems | HackerNoon Briefly

PagedAttention: An Attention Algorithm Inspired By the Classical Virtual Memory in Operating Systems | HackerNoon
PagedAttention: An Attention Algorithm Inspired By the Classical Virtual Memory in Operating Systems | HackerNoon
Briefly