MLOps.community

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

8 snips
Apr 30, 2024
Simon Karasik, an experienced ML Engineer, discusses handling multi-terabyte LLM checkpoints. Topics include managing massive models, cloud storage options, comparing Slurm and Kubernetes, navigating data processing challenges, monitoring Kubernetes nodes with faulty GPUs, and simplifying model training processes.
Ask episode
Chapters
Transcript
Episode notes