xAI's 100,000 H100 Colossus is glued together using Ethernet

from Theregister 5 months ago

Unlike most AI training clusters, xAI's Colossus stands out by utilizing Nvidia's Spectrum-X Ethernet fabric instead of the industry-standard InfiniBand, highlighting a unique networking choice.
Theregisterhttps://www.theregister.com/2024/10/29/xai_colossus_networking/

Colossus boasts an impressive performance of 98.9 exaFLOPS at dense FP/BF16 precision, with potential to double when exploiting model sparsity during training.
Theregisterhttps://www.theregister.com/2024/10/29/xai_colossus_networking/

The colossal system comprises 100,000 Nvidia Hopper GPUs, far exceeding the capacity of the US' leading supercomputer, Frontier, and was deployed in just 122 days.
Theregisterhttps://www.theregister.com/2024/10/29/xai_colossus_networking/

As of early 2024, it's notable that approximately 90% of AI clusters still favor InfiniBand, raising questions about the scalability advantages of Ethernet fabrics like Spectrum-X.
Theregisterhttps://www.theregister.com/2024/10/29/xai_colossus_networking/

Read at Theregister

#ai-supercomputing #xai #colossus #gpu-technology #high-performance-computing

Collection

[

...

]

xAI's 100,000 H100 Colossus is glued together using EthernetxAI's 100,000 H100 Colossus is glued together using Ethernet Briefly

xAI's 100,000 H100 Colossus is glued together using Ethernet
xAI's 100,000 H100 Colossus is glued together using Ethernet
Briefly