PagedAttention: Memory Management in Existing Systems | HackerNoon
Briefly

In existing LLM serving systems, memory is inefficiently managed due to static allocation based on maximum sequence lengths, leading to significant internal and external fragmentation.
The current approach leads to wasted memory in the form of reserved slots for future tokens and unused space, which only becomes apparent post-request completion.
Read at Hackernoon
[
|
]