#inference-optimization tag

Unpacking the deceptively simple science of tokenomics

AI datacenter efficiency is measured by tokens generated per watt, with profitability determined by token revenue minus infrastructure costs, but optimization must balance throughput with service quality requirements.

Artificial intelligence

fromTheregister

1 month ago

Microsoft looks to drive down AI infra costs with Maia 200

Microsoft unveiled the Maia 200 inference accelerator: 144 billion transistors, 10 petaFLOPS FP4, 216GB HBM3e (7TB/s), and 750W power consumption.

Artificial intelligence

fromTechCrunch

1 month ago

Microsoft announces powerful new chip for AI inference | TechCrunch

Maia 200 is a high-performance Microsoft-designed AI inference chip offering over 100 billion transistors and higher petaflop throughput to reduce inference costs and NVIDIA dependence.

Startup companies

fromTechCrunch

1 month ago

Sources: project SGLang spins out as RadixArk with $400M valuation as inference market explodes | TechCrunch

Team behind SGLang launched RadixArk to commercialize inference optimization; RadixArk was recently valued near $400 million and backed by Accel and angel investors.

fromTheregister

4 months ago

Alibaba reveals 82 percent GPU resource savings

Titled "Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market", the paper [PDF] opens by pointing out that model-mart Hugging Face lists over a million AI models, although customers mostly run just a few of them. Alibaba Cloud nonetheless offers many models but found it had to dedicate 17.7 percent of its GPU fleet to serving just 1.35 percent of customer requests.

Artificial intelligence

fromTechCrunch

5 months ago

Clarifai's new reasoning engine makes AI models faster and less expensive | TechCrunch

On Thursday, the AI platform Clarifai announced a new reasoning engine that it claims will make running AI models twice as fast and 40% less expensive. Designed to be adaptable to a variety of models and cloud hosts, the system employs a range of optimizations to get more inference power out of the same hardware. "It's a variety of different types of optimizations, all the way down to CUDA kernels to advanced speculative decoding techniques," said CEO Matthew Zeiler. "You can get more out of the same cards, basically." The results were verified by a string of benchmark tests by the third-party firm Artificial Analysis, which recorded industry-best records for both throughput and latency.

Artificial intelligence

#inference-optimization#inference-optimization

Unpacking the deceptively simple science of tokenomics

Microsoft looks to drive down AI infra costs with Maia 200

Microsoft announces powerful new chip for AI inference | TechCrunch

Sources: project SGLang spins out as RadixArk with $400M valuation as inference market explodes | TechCrunch

Alibaba reveals 82 percent GPU resource savings

Clarifai's new reasoning engine makes AI models faster and less expensive | TechCrunch

#inference-optimization
#inference-optimization