
MLOps.community
Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228
Apr 30, 2024
Simon Karasik, an experienced ML Engineer, discusses handling multi-terabyte LLM checkpoints. Topics include managing massive models, cloud storage options, comparing Slurm and Kubernetes, navigating data processing challenges, monitoring Kubernetes nodes with faulty GPUs, and simplifying model training processes.
55:36
Episode guests
AI Summary
Highlights
AI Chapters
Episode notes
Podcast summary created with Snipd AI
Quick takeaways
- Managing terabyte-sized LLM checkpoints requires a deep understanding of scaling laws and strategic checkpoint frequency planning.
- Utilizing Nebius AI's cloud resources for LLM training can offer tailored tools, GPU availability, and user-friendly interfaces for engineers.
Deep dives
Training Large Language Models at NebiusCI
NebiusCI, a cloud company known for AI-specific cloud services, is currently focusing on training large language models (LLMs). Simon, a journal-driven engineer at NebiusCI, discusses his work on a 300 billion parameter model, highlighting the challenges of managing checkpoints in such large-scale training. The complexity of scaling laws for managing terabyte-sized checkpoints and the need to strategize checkpoint frequency are key points.
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.