
Google SRE Prodcast
SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!
Latest episodes

31 snips
Dec 4, 2024 • 41min
Human Factors in Complex Systems with Casey Rosenthal and John Allspaw
Casey Rosenthal, Founder of Cirrusly.ai, and John Allspaw, Principal of Adaptive Capacity Labs, delve into the complexities of resilience in software engineering. They emphasize the crucial human factors that influence system reliability and adaptability during failures. The discussion reveals the pitfalls of traditional incident metrics, advocating for an understanding of qualitative impacts on users. Additionally, they tackle the cultural challenges organizations face in incident management, highlighting the need for transparency and better communication.

Nov 20, 2024 • 34min
Embracing Complexity with Christina Schulman & Dr. Laura Maguire
Joining the conversation are Christina Schulman, Staff SRE at Google, who focuses on reliability in Google Cloud, and Dr. Laura Maguire, Principal Engineer at Trace Cognitive Engineering, an expert in cognitive systems. They delve into the human side of site reliability engineering, discussing how collaboration and diverse perspectives enhance incident response. Insights include the importance of transparency in learning from failures, managing dependency cycles in complex systems, and the need to embrace complexity to foster resilience in tech environments.

Nov 13, 2024 • 33min
Maglev: load balancing at Google with Cody Smith and Trisha Weir
Cody Smith, CTO and co-founder of Camu Energy, spent over 14 years at Google and contributed to Maglev. Trisha Weir, with 21 years at Google, is an SRE Department Lead. They uncover the evolution of Maglev, a network load balancer essential for traffic management in data centers. Their discussion highlights the significance of psychological safety and collaboration in tech innovation. They also delve into challenges faced during system rollouts, debugging practices, and the shift from manual to automated network provisioning, showcasing a unique blend of technical and teamwork insights.

Oct 30, 2024 • 42min
Profiling data with Pat Somaru and Narayan Desai
Narayan Desai, a Principal SRE at Google, and Pat Somaru, a Senior Production Engineer at Meta, delve into the complexities of observability in site reliability engineering. They discuss the challenges of noise reduction and the importance of actionable insights from high-cardinality data. The pair critiques the reliance on superficial metrics, emphasizing the need for deeper analysis to accurately reflect business outcomes. They also explore data profiling's role in enhancing system performance and optimizing resource management for greater efficiency.

Oct 23, 2024 • 32min
Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes
This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.

Oct 16, 2024 • 34min
SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers
Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.

Oct 9, 2024 • 44min
Incident Response with Sarah Butt and Vrai Stacey
Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.

Oct 2, 2024 • 42min
Building Reliable Systems with Silvia Botros and Niall Murphy
Silvia Botros (SRE Architect, Twilio | Author of "High Performance MySQL, 4th edition”) and Niall Murphy (Co-founder & CEO, Stanza) join hosts Steve McGhee and Jordan Greenberg, to discuss cultural shifts in database engineering, rate limiting, load shedding, holistic approaches to reliability, proactive measures to build customer trust, and much more!

Sep 25, 2024 • 29min
Creating Systems that are Safe with Liz Fong-Jones
Liz Fong-Jones, a former Google SRE and current Field CTO at honeycomb.io, dives into the fascinating world of observability. She shares insights on how observability has evolved from traditional monitoring, likening it to medical diagnostics. Liz emphasizes its critical role in enhancing user satisfaction through Service Level Objectives (SLOs) and discusses the balance between human insight and machine learning in system analysis. Additionally, she highlights the transformation of Site Reliability Engineering, advocating for collaboration and hands-on experience in modern software development.

Sep 18, 2024 • 31min
Production Problems Are For All! with Ben Treynor Sloss
Ben Treynor Sloss (VP of Engineering, Google) joins hosts Steve McGhee and Dr. Jennifer Petoff (Director of Technical Infrastructure Education, Google) to share the evolution of SRE and its impact on software development, how AI and ML significantly impacts SRE practices, and the future of SRE. Ben coined the term "Site Reliability Engineering" for his team of (now) 4,000 software engineers, engaged in what were traditionally operations functions. Under Ben's leadership, Google SRE wrote two best-selling books on SRE. Since then, the rest of the SaaS industry has come to adopt the SRE name, mission, and practices.