Latent Space: The AI Engineer Podcast cover image

State of the Art: Training >70B LLMs on 10,000 H100 clusters

Latent Space: The AI Engineer Podcast

00:00

Optimizing AI Training Infrastructure

This chapter explores the complexities of monitoring hardware performance during AI model training, specifically addressing metrics like memory fragmentation and CPU throttling. It shares insights into effective open-source tools used for model tuning, such as NVIDIA's Megatron and DeepSpeed, while also highlighting the significance of adapting solutions for infrastructure challenges. Additionally, the discussion touches on the importance of cost-aware hyperparameter tuning and innovative metrics for evaluating large language models.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app