Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10
May 7, 2021
auto_awesome
Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.
Collaboration between ML developers and SRE teams is essential to effectively address challenges in maintaining reliable machine learning systems.
Feature engineering plays a critical role in model performance, requiring attention to detail to prevent future data-related issues in production.
Monitoring ML models post-deployment necessitates clear metrics and robust feedback loops between SREs and developers to maintain quality standards.
Deep dives
The Distinction Between ML and Distributed Computing
The discussion emphasizes that while machine learning (ML) is often the primary focus, many tasks are fundamentally about modern distributed computing and effectively managing software on medium-sized collections of computers. Many professionals in the field, including software engineers, systems engineers, and site reliability engineers (SREs), will find ample opportunities in ensuring that ML systems operate smoothly. The speaker encourages newcomers to the data science field to pursue their interest in model building but highlights that there will be significant demand for foundational work around making ML systems function effectively. This indicates a broader scope of responsibilities within the ML ecosystem beyond just model development.
The Importance of Collaboration and a Growth Mindset
A key takeaway from the episode is the necessity for collaboration among ML developers and SRE teams to address the challenges inherent in maintaining ML systems. The discussion covers the balance between ownership and accountability, where model developers are responsible for their models' performance while acknowledging that failures can also arise from platform-related issues. Emphasizing a culture of empathy and support, the speaker suggests that fostering teamwork will ultimately lead to improved problem-solving within the complex ML landscape. This collaborative spirit is seen as essential for advancing both individual and organizational goals.
Feature Engineering and Its Impact on Model Performance
The conversation highlights the crucial role of feature engineering in model development, as it significantly influences the model's performance and reliability. Effective feature selection not only helps in building accurate models but also prevents future issues related to data compatibility and context changes. The speaker notes the importance of consistency in how features are defined and utilized across various systems to avoid pitfalls that could lead to faulty predictions. It is stressed that attention to detail during the feature engineering phase can mitigate downstream problems in production, reiterating the need for thoroughness in this critical phase.
Monitoring and Improving Model Quality
Monitoring ML model performance post-deployment is crucial, yet it presents unique challenges that differ from traditional software services. The podcast discusses the need for establishing clear metrics and benchmarks for model quality, highlighting the significance of a feedback loop between SREs and model developers. This ongoing dialogue is necessary to ensure that both parties understand performance expectations and can promptly address any quality issues. The speaker emphasizes that while current tools provide some monitoring capabilities, further advancements are needed to better support model evaluation and improvement.
The Future of ML Systems and Platform Development
Looking ahead, the discussion signals a need for the development of more robust platforms that facilitate easier experimentation and deployment of ML models. The speaker points out that these platforms should support a wide range of use cases while still allowing for custom solutions where necessary. By improving infrastructure, engineers and developers can innovate more freely, leading to a more vibrant ML ecosystem. This evolution is vital for accommodating the fast-paced advancements in ML technology and ensuring that engineers can focus on solving impactful problems.
Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having recently presented on how ML breaks in production, by examining more than a decade of outage postmortems at Google, Todd joins the show to chat about why many ways that ML systems break in production have nothing to do with ML, what’s different about engineering reliable systems for ML, vs traditional software (and the many ways that they are similar), what he looks for when hiring ML SREs, and more.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode