xAI's 100,000 H100 Colossus is glued together using Ethernet
Briefly

Unlike most AI training clusters, xAI's Colossus stands out by utilizing Nvidia's Spectrum-X Ethernet fabric instead of the industry-standard InfiniBand, highlighting a unique networking choice.
Colossus boasts an impressive performance of 98.9 exaFLOPS at dense FP/BF16 precision, with potential to double when exploiting model sparsity during training.
The colossal system comprises 100,000 Nvidia Hopper GPUs, far exceeding the capacity of the US' leading supercomputer, Frontier, and was deployed in just 122 days.
As of early 2024, it's notable that approximately 90% of AI clusters still favor InfiniBand, raising questions about the scalability advantages of Ethernet fabrics like Spectrum-X.
Read at Theregister
[
|
]