Latent Space: The AI Engineer Podcast cover image

State of the Art: Training >70B LLMs on 10,000 H100 clusters

Latent Space: The AI Engineer Podcast

00:00

Building High-Performance AI Clusters

This chapter explores the complexities of establishing a large-scale machine learning infrastructure, addressing communication and functionality among high-performance machines. It highlights challenges in installation, system maintenance, and the collaborative efforts with hardware manufacturers to ensure reliability and efficiency in AI training environments.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app