#distributed-training tag

Google Metrax Brings Predefined Model Evaluation Metrics to JAX

Metrax is an open-source JAX library providing standardized, performant, distributed-ready evaluation metrics for classification, regression, recommendation, vision, NLP, and audio models.

fromInfoQ

3 months ago

How Discord Scaled Its ML Platform from Single-GPU Workflows to a Shared Ray Cluster

Discord has detailed how it rebuilt its machine learning platform after hitting the limits of single-GPU training. By standardising on Ray and Kubernetes, introducing a one-command cluster CLI, and automating workflows through Dagster and KubeRay, the company turned distributed training into a routine operation. The changes enabled daily retrains for large models and contributed to a 200% uplift in a key ads ranking metric. Similar engineering reports are emerging from companies such as Uber, Pinterest, and Spotify as bespoke models grow in size and frequency.

Artificial intelligence

fromInfoQ

4 months ago

PyTorch Monarch Simplifies Distributed AI Workflows with a Single-Controller Model

Meta's PyTorch team has unveiled Monarch, an open-source framework designed to simplify distributed AI workflows across multiple GPUs and machines. The system introduces a single-controller model that allows one script to coordinate computation across an entire cluster, reducing the complexity of large-scale training and reinforcement learning tasks without changing how developers write standard PyTorch code. Monarch replaces the traditional multi-controller approach, in which multiple copies of the same script run independently across machines, with a single-controller model.

Artificial intelligence

fromWIRED

5 months ago

This Startup Wants to Spark a US DeepSeek Moment

Distributed reinforcement learning enables decentralized training of competitive open-source LLMs across diverse global hardware without reliance on major tech companies.

#distributed-training#distributed-training

Google Metrax Brings Predefined Model Evaluation Metrics to JAX

How Discord Scaled Its ML Platform from Single-GPU Workflows to a Shared Ray Cluster

PyTorch Monarch Simplifies Distributed AI Workflows with a Single-Controller Model

This Startup Wants to Spark a US DeepSeek Moment

#distributed-training
#distributed-training