
Google SRE Prodcast
SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!
Latest episodes

4 snips
Jul 2, 2025 • 37min
The One with STPA, Jeffrey Snover, and Theo Klein
In this engaging conversation, Theo Klein, a Site Reliability Engineer at Google with a passion for STPA, and Jeffrey Snover, a Distinguished Engineer at Google and former Microsoft veteran, dive into Systems Theoretic Process Analysis (STPA). They discuss how STPA shifts the focus from component failures to understanding system control failures. The duo emphasizes the importance of human involvement in system design, revealing how early STPA implementation can identify potential risks before coding begins, ultimately leading to safer and more robust systems.

Jun 25, 2025 • 41min
The One with Startups and Adam Fletcher
In this episode, hosts Steve McGhee and Matt Siegler are joined by guest, Adam Fletcher, CEO and Co-Founder of MarketStreet. They discuss the current state of web development with LLMs, managing technical debt in startups, the evolution of infrastructure and reliability engineering, the role of community in technology, and the future of software engineering with AI.

8 snips
Jun 18, 2025 • 44min
The One with SLOs and Sal Furino
Sal Furino, a Customer Reliability Engineer at Bloomberg, dives into the world of Service Level Objectives (SLOs) and their crucial role in enhancing software reliability. He discusses how SLOs should focus on user-centric metrics rather than technical ones. The conversation highlights the importance of effective communication and collaboration across teams to meet user expectations. Sal also explores the impact of artificial intelligence on setting SLOs, emphasizing proactive decision-making and innovative approaches like digital twins for improved service interactions.

Jun 11, 2025 • 27min
The One With the Future of SRE and Matt Zelesko
Matt Zelesko, Head of Site Reliability Engineering at Google, shares insights about the evolution of SRE and its critical role in today's AI-driven landscape. He discusses the shift from traditional operations to a dynamic model that promotes both speed and reliability. Zelesko envisions AI as a game-changer for SREs, enhancing incident management and enabling teams to tackle complex challenges earlier in the development process. He stresses the importance of continuous improvement and cultural shifts within organizations to better address the intricacies of modern infrastructure.

4 snips
Jun 4, 2025 • 43min
The One with AI and Todd Underwood
In this Google Prodcast episode, Todd Underwood, a reliability expert from Anthropic with experience at Google and OpenAI, discusses the current state and future of AI in SRE. Todd and the hosts focus on the current state and future of AI and ML in production, particularly for SREs. Topics discussed include the challenges of AI-Ops, limitations of current anomaly detection, the potential for AI in config authoring and troubleshooting, trade-offs between product velocity and reliability, the evolving role of SREs in an AI-driven world, and book publication for optimal timing.

May 28, 2025 • 36min
The One With Data Centers and Peter Pellerzi
This episode features guest, Peter Pellerzi (Distinguished Engineer, Google). Peter and the hosts, Matt Siegler and Steve McGhee, focus on the physical infrastructure side of SRE, discussing topics such as the scale of Google's data centers, handling incidents like power outages, testing and preparedness strategies, the use of AI for optimizing cooling plants, and more. Peter also emphasizes the importance of community support, proactive planning, and learning from real-world testing and incidents to ensure high availability and resilience in data center operations.

May 21, 2025 • 20min
The One With Security and Jessica Theodat
Jessica Theodat (Senior SRE & Security Tech Lead, Google) joins hosts Jordan Greenberg and Steve McGhee to discuss the intersection of security and site reliability engineering at Google. Jessica touches on risk management, the unique nature of security incident responses, and the shared goals between security and SRE. The crew also delves into the balance between security and SRE, acknowledging the tension and the need for collaboration between teams to achieve business goals and user trust.

Apr 16, 2025 • 15min
We’re back with Season 4!
The hosts kick off the new season by discussing emerging trends in Site Reliability Engineering and machine learning infrastructure. They warmly welcome a new co-host and reflect on the friendships developed along the way. Anticipation grows for upcoming challenges, new guests, and video content. The conversation dives into navigating ambiguity and engaging with listeners, inviting feedback for future discussions. With excitement for evolving content and a commitment to enhancing user experiences, the stage is set for an enlightening journey ahead!

Jan 29, 2025 • 16min
Special Episode: You Missed a Page from Telebot
Join Javi Beltran, a talented Google engineer based in Zurich, as he reminisces about creating the playful Telebot theme song to ease the stress of on-call engineers. He delves into Telebot's evolution, enhancing communication for engineers with its unique paging system. Discover the emotional rollercoaster of the Telebot ringtone and the creative remix journey that brings a modern twist while preserving its charm. This collaboration highlights the fusion of tech culture and music, proving that innovation can also be fun!

Dec 11, 2024 • 36min
Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano
Dominic Hutton, Staff SRE at HashiCorp with a rich background in engineering, teams up with Niccolo' Cascarano, Senior Staff SRE at Google and a pro in continuous delivery systems. They dive into the intriguing world of configuration management, comparing imperative and declarative workflows. Listeners will learn how declarative methods simplify complexity while imperative approaches can cater to quick tasks. The importance of managing scripts, navigating synchronization pitfalls, and fostering collaboration between development and operations also takes center stage.