Google SRE Prodcast cover image

Google SRE Prodcast

Latest episodes

undefined
31 snips
Dec 4, 2024 • 41min

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

Casey Rosenthal, Founder of Cirrusly.ai, and John Allspaw, Principal of Adaptive Capacity Labs, delve into the complexities of resilience in software engineering. They emphasize the crucial human factors that influence system reliability and adaptability during failures. The discussion reveals the pitfalls of traditional incident metrics, advocating for an understanding of qualitative impacts on users. Additionally, they tackle the cultural challenges organizations face in incident management, highlighting the need for transparency and better communication.
undefined
Nov 20, 2024 • 34min

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

Joining the conversation are Christina Schulman, Staff SRE at Google, who focuses on reliability in Google Cloud, and Dr. Laura Maguire, Principal Engineer at Trace Cognitive Engineering, an expert in cognitive systems. They delve into the human side of site reliability engineering, discussing how collaboration and diverse perspectives enhance incident response. Insights include the importance of transparency in learning from failures, managing dependency cycles in complex systems, and the need to embrace complexity to foster resilience in tech environments.
undefined
Nov 13, 2024 • 33min

Maglev: load balancing at Google with Cody Smith and Trisha Weir

Cody Smith, CTO and co-founder of Camu Energy, spent over 14 years at Google and contributed to Maglev. Trisha Weir, with 21 years at Google, is an SRE Department Lead. They uncover the evolution of Maglev, a network load balancer essential for traffic management in data centers. Their discussion highlights the significance of psychological safety and collaboration in tech innovation. They also delve into challenges faced during system rollouts, debugging practices, and the shift from manual to automated network provisioning, showcasing a unique blend of technical and teamwork insights.
undefined
Oct 30, 2024 • 42min

Profiling data with Pat Somaru and Narayan Desai

Narayan Desai, a Principal SRE at Google, and Pat Somaru, a Senior Production Engineer at Meta, delve into the complexities of observability in site reliability engineering. They discuss the challenges of noise reduction and the importance of actionable insights from high-cardinality data. The pair critiques the reliance on superficial metrics, emphasizing the need for deeper analysis to accurately reflect business outcomes. They also explore data profiling's role in enhancing system performance and optimizing resource management for greater efficiency.
undefined
Oct 23, 2024 • 32min

Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.
undefined
Oct 16, 2024 • 34min

SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg  to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.
undefined
Oct 9, 2024 • 44min

Incident Response with Sarah Butt and Vrai Stacey

Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.
undefined
Oct 2, 2024 • 42min

Building Reliable Systems with Silvia Botros and Niall Murphy

Silvia Botros (SRE Architect, Twilio | Author of "High Performance MySQL, 4th edition”) and Niall Murphy (Co-founder & CEO, Stanza) join hosts Steve McGhee and Jordan Greenberg, to discuss cultural shifts in database engineering, rate limiting, load shedding, holistic approaches to reliability, proactive measures to build customer trust, and much more!
undefined
Sep 25, 2024 • 29min

Creating Systems that are Safe with Liz Fong-Jones

Liz Fong-Jones, a former Google SRE and current Field CTO at honeycomb.io, dives into the fascinating world of observability. She shares insights on how observability has evolved from traditional monitoring, likening it to medical diagnostics. Liz emphasizes its critical role in enhancing user satisfaction through Service Level Objectives (SLOs) and discusses the balance between human insight and machine learning in system analysis. Additionally, she highlights the transformation of Site Reliability Engineering, advocating for collaboration and hands-on experience in modern software development.
undefined
Sep 18, 2024 • 31min

Production Problems Are For All! with Ben Treynor Sloss

Ben Treynor Sloss (VP of Engineering, Google) joins hosts Steve McGhee and Dr. Jennifer Petoff (Director of Technical Infrastructure Education, Google) to share the evolution of SRE and its impact on software development, how AI and ML significantly impacts SRE practices, and the future of SRE. Ben coined the term "Site Reliability Engineering" for his team of (now) 4,000 software engineers, engaged in what were traditionally operations functions. Under Ben's leadership, Google SRE wrote two best-selling books on SRE. Since then, the rest of the SaaS industry has come to adopt the SRE name, mission, and practices. 

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app