

Training Machine Learning (ML) models on Kubernetes
May 31, 2024
Bernie Wu from Memverge discusses training ML models on Kubernetes, including cost-saving tips with spot instances, efficient model checkpoints, hot restarts, and reclaiming GPU resources. They delve into topics like DAG phases, transparent checkpointing, and GPU snapshotting for AI workloads.
Chapters
Transcript
Episode notes
1 2 3 4 5 6 7 8
Introduction
00:00 • 2min
Transition to External Plugin Architecture in Kubernetes and AKS Automatic Solution
02:19 • 2min
Trends in Containerized Applications for Kubernetes Clusters and Recent Acquisition News
04:40 • 4min
Evolution of Kubernetes for AI Workloads
08:47 • 13min
Optimizing Machine Learning Model Training on Kubernetes with Checkpoints
22:08 • 21min
AI Technologies, Project Creo, and Kubernetes Operator Deployment
43:19 • 6min
Exploring Kubernetes Security and Developer Shift Left Strategies
49:48 • 2min
Evolution of Kubernetes towards Stateful Data Management and GPU Resource Optimization
51:21 • 4min