MLOps.community  cover image

MLOps.community

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

Apr 30, 2024
55:36
Snipd AI
Simon Karasik, an experienced ML Engineer, discusses handling multi-terabyte LLM checkpoints. Topics include managing massive models, cloud storage options, comparing Slurm and Kubernetes, navigating data processing challenges, monitoring Kubernetes nodes with faulty GPUs, and simplifying model training processes.
Read more

Podcast summary created with Snipd AI

Quick takeaways

  • Managing terabyte-sized LLM checkpoints requires a deep understanding of scaling laws and strategic checkpoint frequency planning.
  • Utilizing Nebius AI's cloud resources for LLM training can offer tailored tools, GPU availability, and user-friendly interfaces for engineers.

Deep dives

Training Large Language Models at NebiusCI

NebiusCI, a cloud company known for AI-specific cloud services, is currently focusing on training large language models (LLMs). Simon, a journal-driven engineer at NebiusCI, discusses his work on a 300 billion parameter model, highlighting the challenges of managing checkpoints in such large-scale training. The complexity of scaling laws for managing terabyte-sized checkpoints and the need to strategize checkpoint frequency are key points.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode