Efficient Open-Source AI Scaling

Introduction to NVIDIA Dynamo

NVIDIA has launched Dynamo, an open-source inference software designed to accelerate and scale reasoning models within AI factories. Efficiently managing and coordinating AI inference requests across a fleet of GPUs is crucial for AI factories to operate with optimal cost-effectiveness and maximize token revenue generation.

The Importance of AI Inference

As AI reasoning becomes increasingly prevalent, each AI model is expected to generate tens of thousands of tokens with every prompt, representing its "thinking" process. Enhancing inference performance while reducing its cost is crucial for accelerating growth and boosting revenue opportunities for service providers.

A New Generation of AI Inference Software

NVIDIA Dynamo represents a new generation of AI inference software specifically engineered to maximize token revenue generation for AI factories deploying reasoning AI models. Dynamo orchestrates and accelerates inference communication across potentially thousands of GPUs, employing disaggregated serving, a technique that separates the processing and generation phases of large language models (LLMs) onto distinct GPUs.

Key Features of NVIDIA Dynamo

Dynamo can dynamically add, remove, and reallocate GPUs in real-time to adapt to fluctuating request volumes and types. The software can also pinpoint specific GPUs within large clusters that are best suited to minimize response computations and efficiently route queries. Additionally, Dynamo can offload inference data to more cost-effective memory and storage devices while retrieving it rapidly when required, minimizing overall inference costs.

Benefits of NVIDIA Dynamo

Using the same number of GPUs, Dynamo has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform. Furthermore, when running the DeepSeek-R1 model on a large cluster of GB200 NVL72 racks, NVIDIA Dynamo’s intelligent inference optimizations have shown to boost the number of tokens generated by over 30 times per GPU.

Open-Source and Compatibility

NVIDIA Dynamo is being released as a fully open-source project, offering broad compatibility with popular frameworks such as PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM. This open approach supports enterprises, startups, and researchers in developing and optimizing novel methods for serving AI models across disaggregated inference infrastructures.

Supercharging Inference and Agentic AI

A key innovation of NVIDIA Dynamo lies in its ability to map the knowledge that inference systems hold in memory from serving previous requests, known as the KV cache, across potentially thousands of GPUs. The software then intelligently routes new inference requests to the GPUs that possess the best knowledge match, effectively avoiding costly recomputations and freeing up other GPUs to handle new incoming requests.

Industry Support

AI platform Cohere is already planning to leverage NVIDIA Dynamo to enhance the agentic AI capabilities within its Command series of models. "Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination, and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage," explained Saurabh Baji, SVP of engineering at Cohere.

Disaggregated Serving

The NVIDIA Dynamo inference platform features robust support for disaggregated serving, assigning different computational phases of LLMs to different GPUs within the infrastructure. Disaggregated serving is particularly well-suited for reasoning models, such as the new NVIDIA Llama Nemotron model family, which employs advanced inference techniques for improved contextual understanding and response generation.

Four Key Innovations of NVIDIA Dynamo

NVIDIA has highlighted four key innovations within Dynamo that contribute to reducing inference serving costs and enhancing the overall user experience:

GPU Planner: A sophisticated planning engine that dynamically adds and removes GPUs based on fluctuating user demand.
Smart Router: An intelligent, LLM-aware router that directs inference requests across large fleets of GPUs.
Low-Latency Communication Library: An inference-optimized library designed to support state-of-the-art GPU-to-GPU communication.
Memory Manager: An intelligent engine that manages the offloading and reloading of inference data to and from lower-cost memory and storage devices.

Conclusion

NVIDIA Dynamo is a groundbreaking open-source inference software designed to accelerate and scale reasoning models within AI factories. With its ability to dynamically manage and coordinate AI inference requests, disaggregated serving, and intelligent routing, Dynamo is poised to revolutionize the AI industry. As the demand for AI inference continues to grow, NVIDIA Dynamo is well-positioned to play a critical role in shaping the future of AI.

FAQs

What is NVIDIA Dynamo? NVIDIA Dynamo is an open-source inference software designed to accelerate and scale reasoning models within AI factories.
What is disaggregated serving? Disaggregated serving is a technique that assigns different computational phases of large language models to different GPUs within the infrastructure.
What are the key innovations of NVIDIA Dynamo? The four key innovations of NVIDIA Dynamo are GPU Planner, Smart Router, Low-Latency Communication Library, and Memory Manager.
Is NVIDIA Dynamo compatible with popular frameworks? Yes, NVIDIA Dynamo is compatible with popular frameworks such as PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM.