#vllm

[ follow ]
Artificial intelligence
fromInfoWorld
4 days ago

Evolving Kubernetes for generative AI inference

Kubernetes now includes native AI inference features including vLLM support, inference benchmarking, LLM-aware routing, inference gateway extensions, and accelerator scheduling.
fromHackernoon
2 months ago

KV-Cache Fragmentation in LLM Serving & PagedAttention Solution | HackerNoon

Prior reservation wastes memory even if the context lengths are known in advance, demonstrating the inefficiencies in current KV-cache allocation strategies in production systems.
Scala
[ Load more ]