
"KV blocks are like pages. Instead of contiguous memory, PagedAttention divides the KV cache of each sequence into small, fixed-size KV blocks. Each block holds the keys and values for a set number of tokens. Tokens are like bytes. Individual tokens within the KV cache are like the bytes within a page. Requests are like processes. Each LLM request is managed like a process, with its "logical" KV blocks mapped to "physical" KV blocks in GPU memory."
"Since KV blocks are not required to be contiguous in physical memory, PagedAttention can dynamically allocate blocks on demand. This virtually eliminates internal fragmentation because memory is only allocated when needed, and external fragmentation is removed because all blocks are the same size. PagedAttention enables sharing of KV blocks between different sequences, even across different requests. For example, in parallel sampling or beam search, multiple outputs can share the initial prompt's KV cache, saving significant memory."
"It even uses a copy-on-write mechanism (another OS concept) for blocks that need to be modified by different sequences, ensuring efficient sharing without unnecessary duplication. Built on top of PagedAttention, vLLM is an LLM serving system designed for high throughput. It uses block-level memory management and a sophisticated scheduler that works hand-in-hand with PagedAttention."
PagedAttention splits each sequence's KV cache into small, fixed-size KV blocks, treating tokens like bytes and requests like processes with logical-to-physical mappings in GPU memory. Blocks can be allocated noncontiguously on demand, which eliminates internal and external fragmentation by allocating only needed blocks and using uniform block sizes. KV blocks can be shared across sequences and requests, supporting parallel sampling and beam search to save memory. A copy-on-write mechanism handles modifications without unnecessary duplication. vLLM leverages PagedAttention's block-level memory management and an integrated scheduler to enable high-throughput LLM serving.
Read at InfoWorld
Unable to calculate read time
Collection
[
|
...
]