The Art and Science of Training LLMs // Bandish Shah and Davis Blalock // #219
Mar 22, 2024
auto_awesome
Exploring the challenges of training large language models, including debugging issues and evaluating machine learning models effectively. The discussion covers the importance of data quality, efficient computation techniques, and optimizing machine learning model training and deployment for successful outcomes.
Training large language models faces challenges of hardware failures, software instability, and complex debugging processes.
Ensuring model quality demands diverse evaluation metrics and meticulous debugging to address unexpected challenges.
AI development requires adaptive strategies for problem definition, data quality, and consistent progress tracking to navigate evolving environments.
Deep dives
The Challenges of Training Large Language Models
Training large language models involves numerous challenges, including hardware failures, software instability due to frequent breaking changes in libraries like PyTorch, and complex debugging processes. With thousands of GPUs crunching numbers, hardware failures can occur at any time, leading to issues like Nickel timeouts. Additionally, software stacks are not yet mature, causing disruptions with breaking changes and the need for constant maintenance to adjust to evolving environments.
Ensuring Model Quality and Performance Evaluation
Ensuring the quality and performance of trained models is critical, yet challenging due to the lack of consistent evaluation methods and specific use case measurements. Customer use cases often require unique evaluations, making it essential to have diverse evaluation metrics beyond standard leaderboard results. Despite aiming for consistency, factors like breaking changes, version conflicts, and unanticipated hardware or software challenges can impact the final model performance, necessitating meticulous debugging and root cause analysis.
Addressing the Complexity and Uncertainty in AI Development
The continuous evolution and uncertainty in AI development demand a comprehensive approach to problem definition, data quality, and result evaluation. The dynamic nature of the field requires adaptive strategies for scaling laws, predictive analysis, and consistent progress tracking. Amidst the intricacies of training models at scale, the focus extends beyond GPU capacity to encompass robust infrastructure, software stability, and rigorous debugging processes to ensure successful AI development initiatives.
Value of Reliable Configuration in Model Training
Creating a reliable configuration for training models can significantly reduce risks and costs for organizations. By offering a complete configuration that includes images, hyperparameters, and more, users can avoid the complexities of setting up training runs. This approach acts as a form of insurance for training runs, ensuring that models reliably work without extensive troubleshooting, especially crucial for startups that invest significant resources in model training.
Scaling Data Operations and Ensuring Data Quality
Dealing with data quality at scale requires a multi-faceted approach that involves thorough examination of the data. From tokenization pitfalls to addressing issues with data loading and resumption, ensuring data quality demands meticulous attention. Automated processes face challenges due to the unique nature of each dataset, making manual examination essential. Despite automation attempts, true data quality often surfaces at larger scales, requiring substantial compute resources and experimentation to achieve reliable insights.
Huge thank you to Databricks AI for sponsoring this episode. Databricks - http://databricks.com/
Bandish Shah is an Engineering Manager at MosaicML/Databricks, where he focuses on making generative AI training and inference efficient, fast, and accessible by bridging the gap between deep learning, large-scale distributed systems, and performance computing.
Davis Blalock is a Research Scientist and the first employee of Mosaic ML: a GenAI startup acquired for $1.3 billion by Databricks.
MLOps podcast #219 with Databricks' Engineering Manager, Bandish Shah and Research Scientist Davis Blalock, The Art and Science of Training Large Language Models.
// Abstract
What's hard about language models at scale? Turns out...everything. MosaicML's Davis and Bandish share war stories and lessons learned from pushing the limits of LLM training and helping dozens of customers get LLMs into production. They cover what can go wrong at every level of the stack, how to make sure you're building the right solution, and some contrarian takes on the future of efficient models.
// Bio
Bandish Shah
Bandish Shah is an Engineering Manager at MosaicML/Databricks, where he focuses on making generative AI training and inference efficient, fast, and accessible by bridging the gap between deep learning, large-scale distributed systems, and performance computing. Bandish has over a decade of experience building systems for machine learning and enterprise applications. Prior to MosaicML, Bandish held engineering and development roles at SambaNova Systems where he helped develop and ship the first RDU systems from the ground up, and Oracle where he worked as an ASIC engineer for SPARC-based enterprise servers.
Davis Blalock
Davis Blalock is a research scientist at MosaicML. He completed his PhD at MIT, advised by Professor John Guttag. His primary work is designing high-performance machine learning algorithms. He received his M.S. from MIT and his B.S. from the University of Virginia. He is a Qualcomm Innovation Fellow, NSF Graduate Research Fellow, and Barry M. Goldwater Scholar.
// MLOps Jobs board
https://mlops.pallet.xyz/jobs
// MLOps Swag/Merch
https://mlops-community.myshopify.com/
// Related Links
AI Quality In-person Conference: AI Quality in Person Conference: https://www.aiqualityconference.com/
Website: http://databricks.com/
Davis Summarizes Papers Newsletter signup link
Davis' Newsletters:
Learning to recognize spoken words from five unlabeled examples in under two seconds: https://arxiv.org/abs/1609.09196
Training on data at 5GB/s in a single thread: https://arxiv.org/abs/1808.02515
Nearest-neighbor searching through billions of images per second in one thread with no indexing: https://arxiv.org/abs/1706.10283
Multiplying matrices 10-100x faster than a matrix multiply (with some approximation error): https://arxiv.org/abs/2106.10860
Hidden Technical Debt in Machine Learning Systems: https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, blogs, newsletters, and more: https://mlops.community/
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Davis on LinkedIn: https://www.linkedin.com/in/dblalock/
Connect with Bandish on LinkedIn: https://www.linkedin.com/in/bandish-shah/
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode