#pagedattention
#pagedattention

[ follow ]

#memory-management #large-language-models #vllm #performance-optimization #kv-blocks #vattention

Unlocking LLM superpowers: How PagedAttention helps the memory maze

KV blocks are like pages. Instead of contiguous memory, PagedAttention divides the KV cache of each sequence into small, fixed-size KV blocks. Each block holds the keys and values for a set number of tokens. Tokens are like bytes. Individual tokens within the KV cache are like the bytes within a page. Requests are like processes. Each LLM request is managed like a process, with its "logical" KV blocks mapped to "physical" KV blocks in GPU memory.

Artificial intelligence

#large-language-models

Scala

Boosting LLM Decode Throughput: vAttention vs. PagedAttention | HackerNoon

Scala

KV-Cache Fragmentation in LLM Serving & PagedAttention Solution | HackerNoon

Artificial intelligence

Issues with PagedAttention: Kernel Rewrites and Complexity in LLM Serving | HackerNoon

Scala

Boosting LLM Decode Throughput: vAttention vs. PagedAttention | HackerNoon

Scala

KV-Cache Fragmentation in LLM Serving & PagedAttention Solution | HackerNoon

Artificial intelligence

Issues with PagedAttention: Kernel Rewrites and Complexity in LLM Serving | HackerNoon

more#large-language-models

[ Load more ]