AI/ML in Kubernetes, with Maciej Szulik, Clayton Coleman, and Dawn Chen

10 snips

Jun 25, 2024

Guest

Dawn Chen

Three Kubernetes leaders discuss the evolution of Kubernetes, focusing on AI/ML workloads. Topics include enhancements in batch job controllers, intersection of Kubernetes with HPC and AI, customizable enterprise platforms, managing various workloads, and the evolution of AI and ML workloads in Kubernetes.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

From Tester to Developer

Clayton Coleman started his career as a software tester at IBM.
He transitioned to development after consistently finding and reporting bugs.

INSIGHT

Complexity over Simplicity

OpenShift's experience showed that users desire complex solutions, not simple ones.
Kubernetes’s success stems from its ability to accommodate this complexity.

INSIGHT

AI's Current Superpower

AI's current strength is automating repetitive tasks.
This frees up human time for more valuable work.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this episode, we talk to three active leaders who have been around since the very beginning of Kubernetes. We explore how Kubernetes has changed since its inception, with a particular focus on current efforts in Open source Kubernetes to support AI/ML style workloads.

Maciej Szulik is currently taking a seat in the Kubernetes Steering Committee. He’s also leading Special Interests Groups responsible for kubectl, workload and batch controllers. Maciej has been contributing to Kubernetes since the early days, jumping from one area to another where help was needed. He authored the first version of audit and helped shape its current one, as well as touched multiple other places in apimachinery. He was also responsible for designing and implementing Job and CronJob controllers. In kubectl he was responsible for the plugin mechanism and several major refactors to simplify the code. Since May 2024 he joined the ranks of Production Readiness Review (PRR) approvers helping ensure high production standards for the future of Kubernetes releases.

Clayton Coleman is a long-time Kubernetes contributor, having helped launch Kubernetes as open source, being on the bootstrap steering committee, and working across a number of SIGs to make Kubernetes a reliable and powerful foundation for workloads. At Red Hat he led OpenShift’s pivot onto Kubernetes and its growth across on-premise, edge, and into cloud. At Google he is now focused on enabling the next generation of key workloads, especially AI/ML in Kubernetes and on GKE.

Dawn Chen has been a Principal Software Engineer at Google cloud since May 2007. Dawn has worked on an open source project called Kubernetes before the project was founded. She has been one of tech leads in both Kubernetes and GKE, and founded SIG Node from scratch. She also led Anthos platform team for the last 4 years, and mainly focuses on the core infrastructure. Prior to Kubernetes, she was the one of the tech leads for Google internal container infrastructure -- Borg for about 7 years. Outside of work, she is a wife, a mother of a 16-year old boy and a good friend. She enjoys reading, cooking, hiking and traveling.

Do you have something cool to share? Some questions? Let us know:

- web: kubernetespodcast.com

- mail: kubernetespodcast@google.com

- twitter: @kubernetespod

News of the week

Kubernetes 1.31 Code Freeze is on July 9th

Links from the interview

Kubernetes Working Group Batch
Kubernetes Working Group Serving
Blog: Introducing Indexed Jobs (2021)
Docs: Kubernetes Jobs
KEP: Elastic Indexed Jobs
Docs: Kubernetes CronJobs
KubeCon EU 2021: The Long, Winding and Bumpy Road to CronJob’s GA - Maciej Szulik, Red Hat & Alay Patel, Red Hat
KubeCon EU 2018: Writing Kube Controllers for Everyone - Maciej Szulik, Red Hat (Beginner Skill Level)
Kubernetes Working Group Device Management
Kubernetes Enhancement Proposal process README
DockerCon 2014: The announcement of Kubernetes at DockerCon
Blog: AI & Kubernetes (by Kaslin)
Kueue - “Kueue is a cloud-native job queueing system for batch, HPC, AI/ML, and similar applications in a Kubernetes cluster.”
Whitepaper: Large-scale cluster management at {Google} with {Borg}
Email: “Containers: Introduction” - An email introducing the concept of Linux containers to the Linux community

Links from the post-interview chat