Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

11 snips

May 7, 2021

Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

LinkedIn Skills Protest

Todd Underwood humorously protests LinkedIn skills endorsements by listing nonsensical skills like "nuclear proliferation" and "brunch".
He even received a job offer due to his entertaining profile, highlighting the absurdity of the system.

INSIGHT

The Role of a Site Lead

Tech companies often overlook the human element, assuming employees relocate easily.
Site leads address this by focusing on employee well-being and local needs.

ANECDOTE

Becoming a Site Lead

Todd Underwood became a site lead despite being advised against it.
He was persuaded when told the role required genuine care for employee well-being.

Get the Snipd Podcast app to discover more snips from this episode

Get the app