Practical AI

Multi-GPU training is hard (without PyTorch Lightning)

Jun 15, 2021
William Falcon, the creator of PyTorch Lightning and CEO of Grid AI, advocates for efficient AI development. He discusses how PyTorch Lightning simplifies multi-GPU training and allows for seamless model integration without code changes. Falcon highlights the platform's benefits in collaboration and scalability while automating resource allocation in corporate settings. He also addresses financial aspects, comparing cloud versus on-premise training costs, and explains how Grid AI enhances over 100 machine learning models' training directly from laptops.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Origin of PyTorch Lightning

  • William Falcon's journey with PyTorch Lightning started in neuroscience research, where code duplication was a major hurdle.
  • This led him to abstract training code, inspired by scikit-learn's fit method, to improve research agility.
INSIGHT

Decoupling Model and Hardware

  • Decoupling model code from hardware considerations is crucial for code sharing and interoperability.
  • This separation allows different teams with varying hardware constraints to use the same model code.
ANECDOTE

Scaling with Lightning

  • PyTorch Lightning has been used by thousands of companies and labs in diverse fields.
  • A notable example is training a 45-billion parameter GPT model on just eight A100 GPUs with DeepSpeed integration.
Get the Snipd Podcast app to discover more snips from this episode
Get the app