MLOps.community  cover image

MLOps.community

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

Apr 30, 2024
Simon Karasik, an experienced ML Engineer, discusses handling multi-terabyte LLM checkpoints. Topics include managing massive models, cloud storage options, comparing Slurm and Kubernetes, navigating data processing challenges, monitoring Kubernetes nodes with faulty GPUs, and simplifying model training processes.
55:36

Episode guests

Podcast summary created with Snipd AI

Quick takeaways

  • Managing terabyte-sized LLM checkpoints requires a deep understanding of scaling laws and strategic checkpoint frequency planning.
  • Utilizing Nebius AI's cloud resources for LLM training can offer tailored tools, GPU availability, and user-friendly interfaces for engineers.

Deep dives

Training Large Language Models at NebiusCI

NebiusCI, a cloud company known for AI-specific cloud services, is currently focusing on training large language models (LLMs). Simon, a journal-driven engineer at NebiusCI, discusses his work on a 300 billion parameter model, highlighting the challenges of managing checkpoints in such large-scale training. The complexity of scaling laws for managing terabyte-sized checkpoints and the need to strategize checkpoint frequency are key points.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner