
"The move to building and training AI models at scale has had interesting second-order effects; one of the most important is improving how we run and manage massively distributed computing applications. AI training and inferencing both require huge distributed applications to build and refine the models that are at the heart of modern machine learning systems."
"As Brendan Burns, Microsoft CVP Azure OSS Cloud Native, notes in a recent Azure blog post, "Scaling from a laptop experiment to a production-grade workload still feels like reinventing the wheel." Understanding how to break down and orchestrate these workloads takes time and requires significant work in configuring and deploying the resulting systems, even when building on top of existing platforms like Kubernetes."
"Design decisions made in the early days of cloud-native development focused on orchestrating large amounts of data rather than managing significant amounts of compute, both CPU and GPU. Our tools make it hard to orchestrate these new workloads that need modern batch computing techniques. Burns' blog post announced a new Azure partnership to help resolve these issues, working with Anyscale to use a managed version of Ray, its open source, Python-based tool, on Azure Kubernetes Service."
Microsoft partnered with Anyscale to provide a managed Ray runtime on Azure Kubernetes Service for building, training, and running PyTorch-based ML models at scale. Ray offers scheduling for CPU and GPU workloads and native Python libraries that let existing code run distributed with minimal changes. The managed offering reduces infrastructure management overhead and brings batch computing capabilities to cloud-native platforms originally optimized for data orchestration rather than heavy compute. The collaboration aims to simplify deployment, orchestration, and scaling of large distributed AI training and inference workloads on Kubernetes.
Read at InfoWorld
Unable to calculate read time
Collection
[
|
...
]