Latent Space: The AI Engineer Podcast

State of the Art: Training >70B LLMs on 10,000 H100 clusters

24 snips
Jun 25, 2024
In this engaging discussion, Jonathan Frankle, Chief AI Scientist at Databricks, and Josh Albrecht, CTO of Imbue, dive into groundbreaking advancements in AI. They unveil Imbue 70B, a model outperforming GPT-4o with significantly less data. The duo shares insights on the complexities of scaling GPU clusters and the importance of high-performance infrastructure. They also address evaluating language models and introduce innovative tools for hyperparameter tuning. Their expertise shines through as they explore the future of AI in coding and reasoning tasks.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
ANECDOTE

Small Team, Big Impact

  • Imbue's infrastructure team, despite being small (3-6 people), achieved significant accomplishments.
  • Their direct communication with Dell and NVIDIA allowed for faster bug fixes and firmware updates.
ADVICE

Direct Communication with Vendors

  • Prioritize direct communication with hardware vendors like Dell and NVIDIA for faster troubleshooting.
  • This is more effective than relying on cloud providers when dealing with complex infrastructure.
ANECDOTE

Restarting Makes It Worse

  • Imbue encountered a frustrating issue where restarting machines worsened, not improved, performance.
  • Their solution involved a health check examining boot logs for anomalies, ensuring clean restarts.
Get the Snipd Podcast app to discover more snips from this episode
Get the app