Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

157 snips

Oct 28, 2025

Michael Kagan, CTO of NVIDIA and co-founder of Mellanox, discusses the transformative impact of Mellanox on NVIDIA's AI infrastructure. He delves into the technical challenges of scaling GPU clusters to million-GPU data centers and emphasizes that network performance is key to efficiency, not just raw compute power. Kagan envisions AI as a 'spaceship of the mind' that could unlock new physics laws. He also explores the differences in training versus inference workloads and the critical role of high-performance networking in enhancing data center operations.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Compute Growth Outpaced Moore's Law

AI-driven compute demand exploded beyond Moore's Law, requiring system-level innovation instead of only denser chips.
Michael Kagan argues networking becomes as critical as silicon to sustain exponential model growth.

INSIGHT

GPU As A Rack-Scale Building Block

NVIDIA treats a GPU as a rack-scale building block combining hardware, interconnect, and CUDA software.
Kagan says scaling seamlessly from single GPU to 72 GPUs preserves the same software interface for developers.

INSIGHT

Network Quality Limits Parallelism

Network bandwidth, low latency, and narrow latency distribution determine cluster efficiency for massively parallel jobs.
High jitter forces smaller parallelism, so network quality directly limits how many GPUs you can use effectively.

Get the Snipd Podcast app to discover more snips from this episode

Get the app