

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228
8 snips Apr 30, 2024
Simon Karasik, an experienced ML Engineer, discusses handling multi-terabyte LLM checkpoints. Topics include managing massive models, cloud storage options, comparing Slurm and Kubernetes, navigating data processing challenges, monitoring Kubernetes nodes with faulty GPUs, and simplifying model training processes.
Chapters
Transcript
Episode notes
1 2 3 4 5 6 7
Introduction
00:00 • 5min
Transition to Training Large Language Models
04:53 • 11min
Navigating the Checkpoint Conundrum
16:17 • 13min
Comparing Slurm and Kubernetes for Workload Management
29:05 • 2min
Navigating Cloud Storage and Data Processing Challenges
30:42 • 9min
Challenges in Monitoring Kubernetes Nodes with Faulty GPUs
39:49 • 12min
Embracing Simplicity in Model Training Processes and Avoiding Unnecessary Complexities
52:06 • 3min