AI/ML in Kubernetes, with Maciej Szulik, Clayton Coleman, and Dawn Chen
Jun 25, 2024
auto_awesome
Three Kubernetes leaders discuss the evolution of Kubernetes, focusing on AI/ML workloads. Topics include enhancements in batch job controllers, intersection of Kubernetes with HPC and AI, customizable enterprise platforms, managing various workloads, and the evolution of AI and ML workloads in Kubernetes.
Kubernetes evolves to support AI/ML, emphasizing stateful workload management.
Establishment of special interest groups 'serving' and 'batch' enhances Kubernetes scalability.
Borg's influence expands Kubernetes to manage AI/ML workloads and adjust infrastructure.
Focus on rapid infrastructure evolution to meet demands of emerging AI technologies.
Deep dives
High-Performance Computing and AI/ML in Kubernetes
Kubernetes celebrates its 10-year anniversary with a focus on AIML in the final episode of a special series. The discussion delves into infrastructure considerations applicable across various workloads, showcasing notable contributors Clayton Coleman, Don Chen, and more. The evolution of Kubernetes' role in supporting AIML workloads over two years highlights significant advancements like the batch API and job controllers, ensuring efficient job executions and resource optimization.
Transition from Stateless to Stateful Workloads
Debuting during the 10-year celebrations, Maciek Shulik shares his Kubernetes journey, highlighting his contributions to steering committees and interest groups. Maciek details the challenges faced during the initial days of Kubernetes development, emphasizing the shift towards supporting stateful workloads, such as job and cron job controllers, striving for reliability and efficiency.
Exploring AI, ML, and HPC Workloads in Kubernetes
The discussion navigates through the evolving terminology and usage of terms like inference, serving, and high-performance computing. The distinctions between training and inference workloads shed light on serving, batch, and stateful workload needs within Kubernetes as it adapts to the demands of AI/ML applications. The establishment of working groups like serving and batch aims to enhance Kubernetes capabilities in managing diverse workload types.
Role of Working Groups in Kubernetes Evolution
Working groups like 'serving' and 'batch' signify Kubernetes' scalability and specialization for various workload requirements. 'Serving' focuses on real-time inference needs, bridging the gap between traditional serving and AI/ML demands, while 'batch' addresses complexities in scheduling and queuing batch jobs optimally. These specialized groups contribute to the continuous evolution and refinement of Kubernetes functionalities.
Evolution of Kubernetes from Supporting Stateless Workloads to AI and ML Workloads
Kubernetes, initially focused on stateless workloads, has evolved to support AI and ML workloads. The foundation for this evolution stems from Borg, which originally supported batch workloads, leading to Kubernetes inheriting similar capabilities. The focus now includes AI and ML workloads, pushing Kubernetes to new limits and requiring expertise in infrastructure adjustments.
The Role of Working Groups in Advancing Kubernetes Capabilities for AI Workloads
Kubernetes leverages working groups like 'Serving' and 'Devices' to enhance support for AI workloads. These groups address requirements for inference workloads, accelerators like GPUs and TPUs, and batch processing. The formation of specialized groups demonstrates Kubernetes' commitment to evolving and optimizing infrastructure for diverse workload needs.
Shift towards Supporting Stateful Workloads and AI Intensive Applications
Kubernetes originated with a focus on stateless workloads, but has diversified to accommodate stateful workloads and AI applications. Enhancements include GPU and TPU support, tailored resource allocations for different workload types, and evolving job controllers to manage advanced AI and ML tasks. Emphasizing the need for infrastructure to evolve rapidly to meet the demands of emerging technologies.
In this episode, we talk to three active leaders who have been around since the very beginning of Kubernetes. We explore how Kubernetes has changed since its inception, with a particular focus on current efforts in Open source Kubernetes to support AI/ML style workloads.
Maciej Szulik is currently taking a seat in the Kubernetes Steering Committee. He’s also leading Special Interests Groups responsible for kubectl, workload and batch controllers. Maciej has been contributing to Kubernetes since the early days, jumping from one area to another where help was needed. He authored the first version of audit and helped shape its current one, as well as touched multiple other places in apimachinery. He was also responsible for designing and implementing Job and CronJob controllers. In kubectl he was responsible for the plugin mechanism and several major refactors to simplify the code. Since May 2024 he joined the ranks of Production Readiness Review (PRR) approvers helping ensure high production standards for the future of Kubernetes releases.
Clayton Coleman is a long-time Kubernetes contributor, having helped launch Kubernetes as open source, being on the bootstrap steering committee, and working across a number of SIGs to make Kubernetes a reliable and powerful foundation for workloads. At Red Hat he led OpenShift’s pivot onto Kubernetes and its growth across on-premise, edge, and into cloud. At Google he is now focused on enabling the next generation of key workloads, especially AI/ML in Kubernetes and on GKE.
Dawn Chen has been a Principal Software Engineer at Google cloud since May 2007. Dawn has worked on an open source project called Kubernetes before the project was founded. She has been one of tech leads in both Kubernetes and GKE, and founded SIG Node from scratch. She also led Anthos platform team for the last 4 years, and mainly focuses on the core infrastructure. Prior to Kubernetes, she was the one of the tech leads for Google internal container infrastructure -- Borg for about 7 years. Outside of work, she is a wife, a mother of a 16-year old boy and a good friend. She enjoys reading, cooking, hiking and traveling.
Do you have something cool to share? Some questions? Let us know: