#distributed-training

[ follow ]
fromInfoQ
1 day ago

How Discord Scaled Its ML Platform from Single-GPU Workflows to a Shared Ray Cluster

Discord has detailed how it rebuilt its machine learning platform after hitting the limits of single-GPU training. By standardising on Ray and Kubernetes, introducing a one-command cluster CLI, and automating workflows through Dagster and KubeRay, the company turned distributed training into a routine operation. The changes enabled daily retrains for large models and contributed to a 200% uplift in a key ads ranking metric. Similar engineering reports are emerging from companies such as Uber, Pinterest, and Spotify as bespoke models grow in size and frequency.
Artificial intelligence
fromInfoQ
1 month ago

PyTorch Monarch Simplifies Distributed AI Workflows with a Single-Controller Model

Meta's PyTorch team has unveiled Monarch, an open-source framework designed to simplify distributed AI workflows across multiple GPUs and machines. The system introduces a single-controller model that allows one script to coordinate computation across an entire cluster, reducing the complexity of large-scale training and reinforcement learning tasks without changing how developers write standard PyTorch code. Monarch replaces the traditional multi-controller approach, in which multiple copies of the same script run independently across machines, with a single-controller model.
Artificial intelligence
Artificial intelligence
fromWIRED
1 month ago

This Startup Wants to Spark a US DeepSeek Moment

Distributed reinforcement learning enables decentralized training of competitive open-source LLMs across diverse global hardware without reliance on major tech companies.
[ Load more ]