

65k nodes on GKE, with Maciej Rozacki and Wojciech Tyczyński
16 snips Nov 13, 2024
In this engaging discussion, Maciej Rozacki, a Product Manager for AI training at GKE, and Wojciech Tyczyński, a Software Engineer focused on Kubernetes scalability, delve into the monumental support for 65,000 nodes on GKE. They share insights on the innovations that enabled this leap, the complexities of managing large clusters, and how these advancements cater to AI workloads. The duo also emphasizes the importance of open-source contributions and community engagement in shaping the future of Kubernetes scalability.
AI Snips
Chapters
Transcript
Episode notes
GKE's Massive Scale Increase
- GKE now supports 65,000-node clusters, a significant increase from 15,000.
- This expansion is driven by the increasing demand for large-scale AI training.
AI's Impact on Infrastructure
- AI workloads require tightly coupled computation across numerous machines.
- Kubernetes facilitates this by enabling colocation and dynamic resource allocation.
Spanner-Based Storage
- GKE replaced etcd with its own Spanner-based storage for improved scalability and flexibility.
- This multi-tenant solution makes control planes stateless and speeds up operations.