
Training Data Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters
113 snips
Oct 28, 2025 Michael Kagan, CTO of NVIDIA and co-founder of Mellanox, discusses the transformative impact of Mellanox on NVIDIA's AI infrastructure. He delves into the technical challenges of scaling GPU clusters to million-GPU data centers and emphasizes that network performance is key to efficiency, not just raw compute power. Kagan envisions AI as a 'spaceship of the mind' that could unlock new physics laws. He also explores the differences in training versus inference workloads and the critical role of high-performance networking in enhancing data center operations.
AI Snips
Chapters
Transcript
Episode notes
Compute Growth Outpaced Moore's Law
- AI-driven compute demand exploded beyond Moore's Law, requiring system-level innovation instead of only denser chips.
- Michael Kagan argues networking becomes as critical as silicon to sustain exponential model growth.
GPU As A Rack-Scale Building Block
- NVIDIA treats a GPU as a rack-scale building block combining hardware, interconnect, and CUDA software.
- Kagan says scaling seamlessly from single GPU to 72 GPUs preserves the same software interface for developers.
Network Quality Limits Parallelism
- Network bandwidth, low latency, and narrow latency distribution determine cluster efficiency for massively parallel jobs.
- High jitter forces smaller parallelism, so network quality directly limits how many GPUs you can use effectively.

