Kubernetes Bytes

Training Machine Learning (ML) models on Kubernetes

May 31, 2024

Bernie Wu from Memverge discusses training ML models on Kubernetes, including cost-saving tips with spot instances, efficient model checkpoints, hot restarts, and reclaiming GPU resources. They delve into topics like DAG phases, transparent checkpointing, and GPU snapshotting for AI workloads.

Ask episode

Chapters

Transcript

Episode notes

Transition to External Plugin Architecture in Kubernetes and AKS Automatic Solution

Trends in Containerized Applications for Kubernetes Clusters and Recent Acquisition News

Evolution of Kubernetes for AI Workloads

08:47 • 13min

Optimizing Machine Learning Model Training on Kubernetes with Checkpoints

22:08 • 21min

AI Technologies, Project Creo, and Kubernetes Operator Deployment

Exploring Kubernetes Security and Developer Shift Left Strategies

Evolution of Kubernetes towards Stateful Data Management and GPU Resource Optimization