65k nodes on GKE, with Maciej Rozacki and Wojciech Tyczyński

16 snips

Nov 13, 2024

In this engaging discussion, Maciej Rozacki, a Product Manager for AI training at GKE, and Wojciech Tyczyński, a Software Engineer focused on Kubernetes scalability, delve into the monumental support for 65,000 nodes on GKE. They share insights on the innovations that enabled this leap, the complexities of managing large clusters, and how these advancements cater to AI workloads. The duo also emphasizes the importance of open-source contributions and community engagement in shaping the future of Kubernetes scalability.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

GKE's Massive Scale Increase

GKE now supports 65,000-node clusters, a significant increase from 15,000.
This expansion is driven by the increasing demand for large-scale AI training.

INSIGHT

AI's Impact on Infrastructure

AI workloads require tightly coupled computation across numerous machines.
Kubernetes facilitates this by enabling colocation and dynamic resource allocation.

ANECDOTE

Spanner-Based Storage

GKE replaced etcd with its own Spanner-based storage for improved scalability and flexibility.
This multi-tenant solution makes control planes stateless and speeds up operations.

Get the Snipd Podcast app to discover more snips from this episode

Get the app