Kubernetes Bytes

Training Machine Learning (ML) models on Kubernetes

May 31, 2024
Bernie Wu from Memverge discusses training ML models on Kubernetes, including cost-saving tips with spot instances, efficient model checkpoints, hot restarts, and reclaiming GPU resources. They delve into topics like DAG phases, transparent checkpointing, and GPU snapshotting for AI workloads.
Ask episode
Chapters
Transcript
Episode notes