Software Misadventures cover image

Software Misadventures

Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

May 7, 2021
Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.
01:07:34

Episode guests

Podcast summary created with Snipd AI

Quick takeaways

  • Collaboration between ML developers and SRE teams is essential to effectively address challenges in maintaining reliable machine learning systems.
  • Feature engineering plays a critical role in model performance, requiring attention to detail to prevent future data-related issues in production.

Deep dives

The Distinction Between ML and Distributed Computing

The discussion emphasizes that while machine learning (ML) is often the primary focus, many tasks are fundamentally about modern distributed computing and effectively managing software on medium-sized collections of computers. Many professionals in the field, including software engineers, systems engineers, and site reliability engineers (SREs), will find ample opportunities in ensuring that ML systems operate smoothly. The speaker encourages newcomers to the data science field to pursue their interest in model building but highlights that there will be significant demand for foundational work around making ML systems function effectively. This indicates a broader scope of responsibilities within the ML ecosystem beyond just model development.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner