Training Data

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

113 snips
Oct 28, 2025
Michael Kagan, CTO of NVIDIA and co-founder of Mellanox, discusses the transformative impact of Mellanox on NVIDIA's AI infrastructure. He delves into the technical challenges of scaling GPU clusters to million-GPU data centers and emphasizes that network performance is key to efficiency, not just raw compute power. Kagan envisions AI as a 'spaceship of the mind' that could unlock new physics laws. He also explores the differences in training versus inference workloads and the critical role of high-performance networking in enhancing data center operations.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Compute Growth Outpaced Moore's Law

  • AI-driven compute demand exploded beyond Moore's Law, requiring system-level innovation instead of only denser chips.
  • Michael Kagan argues networking becomes as critical as silicon to sustain exponential model growth.
INSIGHT

GPU As A Rack-Scale Building Block

  • NVIDIA treats a GPU as a rack-scale building block combining hardware, interconnect, and CUDA software.
  • Kagan says scaling seamlessly from single GPU to 72 GPUs preserves the same software interface for developers.
INSIGHT

Network Quality Limits Parallelism

  • Network bandwidth, low latency, and narrow latency distribution determine cluster efficiency for massively parallel jobs.
  • High jitter forces smaller parallelism, so network quality directly limits how many GPUs you can use effectively.
Get the Snipd Podcast app to discover more snips from this episode
Get the app