Software Misadventures

Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

11 snips
May 7, 2021
Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

LinkedIn Skills Protest

  • Todd Underwood humorously protests LinkedIn skills endorsements by listing nonsensical skills like "nuclear proliferation" and "brunch".
  • He even received a job offer due to his entertaining profile, highlighting the absurdity of the system.
INSIGHT

The Role of a Site Lead

  • Tech companies often overlook the human element, assuming employees relocate easily.
  • Site leads address this by focusing on employee well-being and local needs.
ANECDOTE

Becoming a Site Lead

  • Todd Underwood became a site lead despite being advised against it.
  • He was persuaded when told the role required genuine care for employee well-being.
Get the Snipd Podcast app to discover more snips from this episode
Get the app