Latent Space: The AI Engineer Podcast cover image

State of the Art: Training >70B LLMs on 10,000 H100 clusters

Latent Space: The AI Engineer Podcast

00:00

Building High-Performance AI Clusters

This chapter explores the complexities of establishing a large-scale machine learning infrastructure, addressing communication and functionality among high-performance machines. It highlights challenges in installation, system maintenance, and the collaborative efforts with hardware manufacturers to ensure reliability and efficiency in AI training environments.

Play episode from 14:12
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app