
Google SRE Prodcast
SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!
Latest episodes

Apr 16, 2025 • 15min
We’re back with Season 4!
In this "bumpisode", hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliability Engineering (SRE) and AI, and the friends we make along the way. They also introduce new elements we are bringing in with Season 4, such as a video format and a feedback form.

Jan 29, 2025 • 16min
Special Episode: You Missed a Page from Telebot
Join Javi Beltran, a talented Google engineer based in Zurich, as he reminisces about creating the playful Telebot theme song to ease the stress of on-call engineers. He delves into Telebot's evolution, enhancing communication for engineers with its unique paging system. Discover the emotional rollercoaster of the Telebot ringtone and the creative remix journey that brings a modern twist while preserving its charm. This collaboration highlights the fusion of tech culture and music, proving that innovation can also be fun!

Dec 11, 2024 • 36min
Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano
Dominic Hutton, Staff SRE at HashiCorp with a rich background in engineering, teams up with Niccolo' Cascarano, Senior Staff SRE at Google and a pro in continuous delivery systems. They dive into the intriguing world of configuration management, comparing imperative and declarative workflows. Listeners will learn how declarative methods simplify complexity while imperative approaches can cater to quick tasks. The importance of managing scripts, navigating synchronization pitfalls, and fostering collaboration between development and operations also takes center stage.

31 snips
Dec 4, 2024 • 41min
Human Factors in Complex Systems with Casey Rosenthal and John Allspaw
Casey Rosenthal, Founder of Cirrusly.ai, and John Allspaw, Principal of Adaptive Capacity Labs, delve into the complexities of resilience in software engineering. They emphasize the crucial human factors that influence system reliability and adaptability during failures. The discussion reveals the pitfalls of traditional incident metrics, advocating for an understanding of qualitative impacts on users. Additionally, they tackle the cultural challenges organizations face in incident management, highlighting the need for transparency and better communication.

Nov 20, 2024 • 34min
Embracing Complexity with Christina Schulman & Dr. Laura Maguire
Joining the conversation are Christina Schulman, Staff SRE at Google, who focuses on reliability in Google Cloud, and Dr. Laura Maguire, Principal Engineer at Trace Cognitive Engineering, an expert in cognitive systems. They delve into the human side of site reliability engineering, discussing how collaboration and diverse perspectives enhance incident response. Insights include the importance of transparency in learning from failures, managing dependency cycles in complex systems, and the need to embrace complexity to foster resilience in tech environments.

Nov 13, 2024 • 33min
Maglev: load balancing at Google with Cody Smith and Trisha Weir
Cody Smith, CTO and co-founder of Camu Energy, spent over 14 years at Google and contributed to Maglev. Trisha Weir, with 21 years at Google, is an SRE Department Lead. They uncover the evolution of Maglev, a network load balancer essential for traffic management in data centers. Their discussion highlights the significance of psychological safety and collaboration in tech innovation. They also delve into challenges faced during system rollouts, debugging practices, and the shift from manual to automated network provisioning, showcasing a unique blend of technical and teamwork insights.

Oct 30, 2024 • 42min
Profiling data with Pat Somaru and Narayan Desai
Narayan Desai, a Principal SRE at Google, and Pat Somaru, a Senior Production Engineer at Meta, delve into the complexities of observability in site reliability engineering. They discuss the challenges of noise reduction and the importance of actionable insights from high-cardinality data. The pair critiques the reliance on superficial metrics, emphasizing the need for deeper analysis to accurately reflect business outcomes. They also explore data profiling's role in enhancing system performance and optimizing resource management for greater efficiency.

Oct 23, 2024 • 32min
Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes
This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.

Oct 16, 2024 • 34min
SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers
Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.

Oct 9, 2024 • 44min
Incident Response with Sarah Butt and Vrai Stacey
Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.