

Multi-GPU training is hard (without PyTorch Lightning)
Jun 15, 2021
William Falcon, the creator of PyTorch Lightning and CEO of Grid AI, advocates for efficient AI development. He discusses how PyTorch Lightning simplifies multi-GPU training and allows for seamless model integration without code changes. Falcon highlights the platform's benefits in collaboration and scalability while automating resource allocation in corporate settings. He also addresses financial aspects, comparing cloud versus on-premise training costs, and explains how Grid AI enhances over 100 machine learning models' training directly from laptops.
AI Snips
Chapters
Transcript
Episode notes
Origin of PyTorch Lightning
- William Falcon's journey with PyTorch Lightning started in neuroscience research, where code duplication was a major hurdle.
- This led him to abstract training code, inspired by scikit-learn's fit method, to improve research agility.
Decoupling Model and Hardware
- Decoupling model code from hardware considerations is crucial for code sharing and interoperability.
- This separation allows different teams with varying hardware constraints to use the same model code.
Scaling with Lightning
- PyTorch Lightning has been used by thousands of companies and labs in diverse fields.
- A notable example is training a 45-billion parameter GPT model on just eight A100 GPUs with DeepSpeed integration.