#pagedattention

[ follow ]
fromInfoWorld
3 days ago

Unlocking LLM superpowers: How PagedAttention helps the memory maze

KV blocks are like pages. Instead of contiguous memory, PagedAttention divides the KV cache of each sequence into small, fixed-size KV blocks. Each block holds the keys and values for a set number of tokens. Tokens are like bytes. Individual tokens within the KV cache are like the bytes within a page. Requests are like processes. Each LLM request is managed like a process, with its "logical" KV blocks mapped to "physical" KV blocks in GPU memory.
Artificial intelligence
#large-language-models
fromHackernoon
3 months ago
Artificial intelligence

Issues with PagedAttention: Kernel Rewrites and Complexity in LLM Serving | HackerNoon

fromHackernoon
3 months ago
Artificial intelligence

Issues with PagedAttention: Kernel Rewrites and Complexity in LLM Serving | HackerNoon

[ Load more ]