Titled "Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market", the paper [PDF] opens by pointing out that model-mart Hugging Face lists over a million AI models, although customers mostly run just a few of them. Alibaba Cloud nonetheless offers many models but found it had to dedicate 17.7 percent of its GPU fleet to serving just 1.35 percent of customer requests.
On Thursday, the AI platform Clarifai announced a new reasoning engine that it claims will make running AI models twice as fast and 40% less expensive. Designed to be adaptable to a variety of models and cloud hosts, the system employs a range of optimizations to get more inference power out of the same hardware. "It's a variety of different types of optimizations, all the way down to CUDA kernels to advanced speculative decoding techniques," said CEO Matthew Zeiler. "You can get more out of the same cards, basically." The results were verified by a string of benchmark tests by the third-party firm Artificial Analysis, which recorded industry-best records for both throughput and latency.