Alibaba reveals 82 percent GPU resource savings

"Titled "Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market", the paper [PDF] opens by pointing out that model-mart Hugging Face lists over a million AI models, although customers mostly run just a few of them. Alibaba Cloud nonetheless offers many models but found it had to dedicate 17.7 percent of its GPU fleet to serving just 1.35 percent of customer requests."

"The reason for that discrepancy is that service providers typically configure their GPUs to run only two or three models, which is all that GPUs can handle because they don't have enough memory to run more. That approach means that an outfit like Alibaba Cloud could have thousands of idle GPUs dedicated to seldom-used models. That's obviously untenable given the cost of GPUs and, for Alibaba, the difficulty of acquiring kit from Nvidia and AMD due to US sanctions."

Many available AI models are rarely used, leading providers to dedicate GPUs to seldom-run models and thus waste capacity. GPUs are typically configured to run only two or three models because of memory limits, creating idle model-dedicated hardware. Alibaba implemented GPU pooling and memory-management techniques to run more models per GPU and to offload data into host memory or other storage. In beta testing, Aegaeon cut required GPUs from 1,192 to 213, an 82 percent resource saving, and enabled some GPUs to run tens of models. The improvements significantly boost utilization but represent operational optimization rather than a market-disrupting breakthrough.

#gpu-pooling #memory-management #llm-serving #inference-optimization

Read at Theregister

Unable to calculate read time

Collection

[

...

]

Alibaba reveals 82 percent GPU resource savingsAlibaba reveals 82 percent GPU resource savings Briefly

Alibaba reveals 82 percent GPU resource savings
Alibaba reveals 82 percent GPU resource savings
Briefly