Latent Space: The AI Engineer Podcast cover image

State of the Art: Training >70B LLMs on 10,000 H100 clusters

Latent Space: The AI Engineer Podcast

00:00

Optimizing AI Training Infrastructure

This chapter explores the complexities of monitoring hardware performance during AI model training, specifically addressing metrics like memory fragmentation and CPU throttling. It shares insights into effective open-source tools used for model tuning, such as NVIDIA's Megatron and DeepSpeed, while also highlighting the significance of adapting solutions for infrastructure challenges. Additionally, the discussion touches on the importance of cost-aware hyperparameter tuning and innovative metrics for evaluating large language models.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner
Get the app