How Perplexity optimized 1T parameter AI models for AWS EFA
Briefly

How Perplexity optimized 1T parameter AI models for AWS EFA
"AI search provider Perplexity's research wing has developed a new set of software optimizations that allows for trillion parameter or large models to run efficiently across older, cheaper hardware using a variety of existing network technologies, including Amazon's proprietary Elastic Fabric Adapter. These innovations, detailed in a paper published this week and released on GitHub for further scrutiny, present a novel approach to addressing one of the biggest challenges in serving large-scale mixture of experts models (MoE) at scale: memory and network latency."
"MoE models, like DeepSeek V3 and R1 or Moonshot AI's Kimi K2, are big, ranging from 671 billion to 1 trillion parameters. This means they're too large to run on eight-GPU systems using older H100 or H200 GPUs at scale. Sure, in some cases you might be able to fit the model weights, but you won't have enough memory left over for the key value caches (the model's short-term memory) required to serve it at any reasonable scale."
"The easy answer would be to deploy these models on Nvidia's GB200 or GB300 NVL72 rack systems, which essentially function as one great big server with 72 192GB or 288GB of GPUs on board, more than enough for even larger multi-trillion parameter LLMs. Unfortunately, these systems are expensive, in extremely high demand, and may not be available in every geography - cough, cough, China."
Software optimizations reduce memory and network latency to enable trillion-parameter mixture-of-experts (MoE) models to run across older, lower-cost hardware and existing networking stacks. MoE models commonly span hundreds of billions to a trillion parameters and require significant key-value cache memory for serving, which eight-GPU H100/H200 systems often cannot accommodate at scale. Large NVL72 rack systems can host such models but are costly and scarce. Distributing models across many smaller nodes on older hardware introduces steep performance penalties because sparse MoE routing sends tokens to subsets of experts, increasing network demand and latency.
Read at Theregister
Unable to calculate read time
[
|
]