Google SRE Prodcast cover image

Google SRE Prodcast

Latest episodes

undefined
4 snips
Jul 2, 2025 • 37min

The One with STPA, Jeffrey Snover, and Theo Klein

In this engaging conversation, Theo Klein, a Site Reliability Engineer at Google with a passion for STPA, and Jeffrey Snover, a Distinguished Engineer at Google and former Microsoft veteran, dive into Systems Theoretic Process Analysis (STPA). They discuss how STPA shifts the focus from component failures to understanding system control failures. The duo emphasizes the importance of human involvement in system design, revealing how early STPA implementation can identify potential risks before coding begins, ultimately leading to safer and more robust systems.
undefined
Jun 25, 2025 • 41min

The One with Startups and Adam Fletcher

In this episode, hosts Steve McGhee and Matt Siegler are joined by guest, Adam Fletcher, CEO and Co-Founder of MarketStreet. They discuss the current state of web development with LLMs, managing technical debt in startups, the evolution of infrastructure and reliability engineering, the role of community in technology, and the future of software engineering with AI.
undefined
8 snips
Jun 18, 2025 • 44min

The One with SLOs and Sal Furino

Sal Furino, a Customer Reliability Engineer at Bloomberg, dives into the world of Service Level Objectives (SLOs) and their crucial role in enhancing software reliability. He discusses how SLOs should focus on user-centric metrics rather than technical ones. The conversation highlights the importance of effective communication and collaboration across teams to meet user expectations. Sal also explores the impact of artificial intelligence on setting SLOs, emphasizing proactive decision-making and innovative approaches like digital twins for improved service interactions.
undefined
Jun 11, 2025 • 27min

The One With the Future of SRE and Matt Zelesko

Matt Zelesko, Head of Site Reliability Engineering at Google, shares insights about the evolution of SRE and its critical role in today's AI-driven landscape. He discusses the shift from traditional operations to a dynamic model that promotes both speed and reliability. Zelesko envisions AI as a game-changer for SREs, enhancing incident management and enabling teams to tackle complex challenges earlier in the development process. He stresses the importance of continuous improvement and cultural shifts within organizations to better address the intricacies of modern infrastructure.
undefined
4 snips
Jun 4, 2025 • 43min

The One with AI and Todd Underwood

In this Google Prodcast episode, Todd Underwood, a reliability expert from Anthropic with experience at Google and OpenAI, discusses the current state and future of AI in SRE. Todd and the hosts focus on the current state and future of AI and ML in production, particularly for SREs. Topics discussed include the challenges of AI-Ops, limitations of current anomaly detection, the potential for AI in config authoring and troubleshooting, trade-offs between product velocity and reliability, the evolving role of SREs in an AI-driven world, and book publication for optimal timing.
undefined
May 28, 2025 • 36min

The One With Data Centers and Peter Pellerzi

This episode features guest, Peter Pellerzi (Distinguished Engineer, Google). Peter and the hosts, Matt Siegler and Steve McGhee, focus on the physical infrastructure side of SRE, discussing topics such as the scale of Google's data centers, handling incidents like power outages, testing and preparedness strategies, the use of AI for optimizing cooling plants, and more. Peter also emphasizes the importance of community support, proactive planning, and learning from real-world testing and incidents to ensure high availability and resilience in data center operations.
undefined
May 21, 2025 • 20min

The One With Security and Jessica Theodat

Jessica Theodat (Senior SRE & Security Tech Lead, Google) joins hosts Jordan Greenberg and Steve McGhee to discuss the intersection of security and site reliability engineering at Google. Jessica touches on risk management, the unique nature of security incident responses, and the shared goals between security and SRE. The crew also delves into the balance between security and SRE, acknowledging the tension and the need for collaboration between teams to achieve business goals and user trust.
undefined
Apr 16, 2025 • 15min

We’re back with Season 4!

The hosts kick off the new season by discussing emerging trends in Site Reliability Engineering and machine learning infrastructure. They warmly welcome a new co-host and reflect on the friendships developed along the way. Anticipation grows for upcoming challenges, new guests, and video content. The conversation dives into navigating ambiguity and engaging with listeners, inviting feedback for future discussions. With excitement for evolving content and a commitment to enhancing user experiences, the stage is set for an enlightening journey ahead!
undefined
Jan 29, 2025 • 16min

Special Episode: You Missed a Page from Telebot

Join Javi Beltran, a talented Google engineer based in Zurich, as he reminisces about creating the playful Telebot theme song to ease the stress of on-call engineers. He delves into Telebot's evolution, enhancing communication for engineers with its unique paging system. Discover the emotional rollercoaster of the Telebot ringtone and the creative remix journey that brings a modern twist while preserving its charm. This collaboration highlights the fusion of tech culture and music, proving that innovation can also be fun!
undefined
Dec 11, 2024 • 36min

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

Dominic Hutton, Staff SRE at HashiCorp with a rich background in engineering, teams up with Niccolo' Cascarano, Senior Staff SRE at Google and a pro in continuous delivery systems. They dive into the intriguing world of configuration management, comparing imperative and declarative workflows. Listeners will learn how declarative methods simplify complexity while imperative approaches can cater to quick tasks. The importance of managing scripts, navigating synchronization pitfalls, and fostering collaboration between development and operations also takes center stage.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app