
"This challenge is sparking innovations in the inference stack. That's where Dynamo comes in. Dynamo is an open-source framework for distributed inference. It manages execution across GPUs and nodes. It breaks inference into phases, like prefill and decode. It also separates memory-bound and compute-bound tasks. Plus, it dynamically manages GPU resources to boost usage and keep latency low. Dynamo allows infrastructure teams to scale inference capacity responsively, handling demand spikes without permanently overprovisioning expensive GPU resources."
"The recent report details how the authors used Dynamo on a Kubernetes cluster (AKS). This setup runs on special rack-scale VM instances, the ND GB200-v6, featuring 72 tightly integrated NVIDIA Blackwell GPUs. They used this setup to run the open-source 120B-parameter model GPT-OSS 120B, using a tested 'InferenceMAX' recipe. This setup achieved 1.2 million tokens per second, showing that Dynamo can handle enterprise-level inference tasks on regular clusters."
Serving Large Language Models (LLMs) at scale exceeds single GPU or single-node capacity, forcing multi-node distributed GPU deployments for 70B+ and 120B+ models and large context pipelines. Dynamo is an open-source framework that manages distributed inference across GPUs and nodes by breaking execution into phases such as prefill and decode, separating memory-bound and compute-bound tasks, and dynamically managing GPU resources to increase utilization and reduce latency. Dynamo integrates with multiple inference engines including TensorRT-LLM, vLLM, and SGLang. Azure and NVIDIA demonstrated Dynamo on Kubernetes running GPT-OSS 120B on ND GB200-v6 instances, achieving 1.2 million tokens per second.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]