Google SRE Prodcast

Salim Virji

SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!

Episodes

Mentioned books

Jan 14, 2026 • 25min

The One With Heather Adkins and Security (and AI)

Heather Adkins, leader of Google's Office of Cybersecurity Resilience and a seasoned expert with over two decades at Google, dives into the future of digital defenses. She discusses the rise of 'Agentic AI hackers' and polymorphic malware, urging a shift in how we approach cybersecurity. From the 'Secure by Design' philosophy to innovative defense strategies, Heather emphasizes the importance of layered security. Her insights on utilizing AI for incident analysis and the need to harden critical nodes reveal a crucial evolution in tackling emerging threats.

Jan 7, 2026 • 39min

The One With SLOs

Join Alex Hidalgo, an author and SRE expert, and Brian Singer, co-founder at nobl9, as they dive into the world of Service Level Objectives (SLOs). They discuss how SLOs create a common language for teams and explore the varying degrees of adoption across organizations. The conversation highlights crafting user-specific SLOs, the importance of ownership, and the pitfalls of central governance. They also touch on AI's potential in SLO design and the necessity of human oversight. Tune in for actionable insights on SLOs and their cultural impact!

Dec 16, 2025 • 34min

The One With Steph Hippo and Observability

Steph Hippo, Platform Engineering Director at Honeycomb, shares her expertise in AI-driven observability during a fascinating conversation. She explains how observability is key for understanding complex systems, creating a symbiotic relationship with AI. The discussion highlights how AI can enhance incident response, lead to self-healing systems, and significantly improve junior SRE onboarding. Steph encourages small teams to learn from others' mistakes and emphasizes the importance of structured growth conversations and experimentation.

Jul 30, 2025 • 32min

The One with Ben Good and Our Kubernetes Friends

Ben Good, a Google Cloud Solutions Architect skilled in platform engineering, joins Kaslin Fields, co-host of the Kubernetes podcast. They dive into the powerful role of Kubernetes in platform engineering, discussing how to create user-friendly 'golden paths' for developers. The conversation highlights the significance of observability, adapting to evolving customer needs, and improving deployment archetypes. They explore the importance of DORA metrics for assessing team success, all while emphasizing a tailored approach to platform design and user experience.

Jul 23, 2025 • 42min

The One With AI Agents, Ramón Llamas, and Swapnil Haria

In this installment, Swapnil Haria, a Google Software Engineer specializing in AI agents, and Ramón Llamas, a seasoned Staff Site Reliability Engineer, delve into the transformative impact of AI on production management. They discuss how these agents can summarize alerts, detect hidden errors, and even prevent outages. The duo highlights the balance between human expertise and AI capabilities, the complexities of evaluating non-deterministic systems, and the importance of structured postmortems in enhancing incident response.

Jul 16, 2025 • 28min

The One with Technical Program Managers and Karanveer Anand

This episode features Google Technical Program Manager (TPM) Karanveer Anand, who joins our hosts to discuss the unique role of TPMs in Site Reliability Engineering (SRE). The conversation highlights how SRE TPMs bridge the gap between technical details and business impact, managing complex projects with inter-team dependencies and ensuring system reliability, particularly in the rapidly evolving AI landscape.

Jul 2, 2025 • 37min

The One with STPA, Jeffrey Snover, and Theo Klein

In this engaging conversation, Theo Klein, a Site Reliability Engineer at Google with a passion for STPA, and Jeffrey Snover, a Distinguished Engineer at Google and former Microsoft veteran, dive into Systems Theoretic Process Analysis (STPA). They discuss how STPA shifts the focus from component failures to understanding system control failures. The duo emphasizes the importance of human involvement in system design, revealing how early STPA implementation can identify potential risks before coding begins, ultimately leading to safer and more robust systems.

Jun 25, 2025 • 41min

The One with Startups and Adam Fletcher

In this episode, hosts Steve McGhee and Matt Siegler are joined by guest, Adam Fletcher, CEO and Co-Founder of MarketStreet. They discuss the current state of web development with LLMs, managing technical debt in startups, the evolution of infrastructure and reliability engineering, the role of community in technology, and the future of software engineering with AI.

Jun 18, 2025 • 44min

The One with SLOs and Sal Furino

Sal Furino, a Customer Reliability Engineer at Bloomberg, dives into the world of Service Level Objectives (SLOs) and their crucial role in enhancing software reliability. He discusses how SLOs should focus on user-centric metrics rather than technical ones. The conversation highlights the importance of effective communication and collaboration across teams to meet user expectations. Sal also explores the impact of artificial intelligence on setting SLOs, emphasizing proactive decision-making and innovative approaches like digital twins for improved service interactions.

Jun 11, 2025 • 27min

The One With the Future of SRE and Matt Zelesko

Matt Zelesko, Head of Site Reliability Engineering at Google, shares insights about the evolution of SRE and its critical role in today's AI-driven landscape. He discusses the shift from traditional operations to a dynamic model that promotes both speed and reliability. Zelesko envisions AI as a game-changer for SREs, enhancing incident management and enabling teams to tackle complex challenges earlier in the development process. He stresses the importance of continuous improvement and cultural shifts within organizations to better address the intricacies of modern infrastructure.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner