Many AI projects fail. The reason is often simple. Teams try to rebuild last decade's applications but add AI on top: A CRM system with AI. A chatbot with AI. A search engine with AI. The pattern is the same: "X, but now with AI." These projects usually look fine in a demo, but they rarely work in production. The problem is that AI doesn't just extend old systems. It changes what applications are and how they behave.
Titled "Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market", the paper [PDF] opens by pointing out that model-mart Hugging Face lists over a million AI models, although customers mostly run just a few of them. Alibaba Cloud nonetheless offers many models but found it had to dedicate 17.7 percent of its GPU fleet to serving just 1.35 percent of customer requests.
In addition to expanding the capabilities of raw linear memories, support was added for a new form of storage that is managed by the Wasm runtime automatically. WasmGC is low-level; a compiler targeting Wasm can declare the memory layout of its runtime data structures in terms of struct and array types, plus unboxed tagged integers, whose allocation and lifetime are then handled by Wasm.
KV blocks are like pages. Instead of contiguous memory, PagedAttention divides the KV cache of each sequence into small, fixed-size KV blocks. Each block holds the keys and values for a set number of tokens. Tokens are like bytes. Individual tokens within the KV cache are like the bytes within a page. Requests are like processes. Each LLM request is managed like a process, with its "logical" KV blocks mapped to "physical" KV blocks in GPU memory.
The JVM has two primary functions: to allow Java programs to run on any device or operating system (known as the "write once, run anywhere" principle), and to manage and optimize program memory. When Java was released in 1995, all computer programs were written to a specific operating system, and program memory was managed by the software developer. It's not hard to see why the JVM was a revelation in that era.
When you're writing your first Rust programs, the complexities of ownership and borrowing can be dizzying. If all you want to do is write a simple program that doesn't need to be performant, Rust's memory management might seem intrusive.