Google SRE Prodcast cover image

Google SRE Prodcast

Latest episodes

undefined
Apr 16, 2025 • 15min

We’re back with Season 4!

In this "bumpisode", hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliability Engineering (SRE) and AI, and the friends we make along the way. They also introduce new elements we are bringing in with Season 4, such as a video format and a feedback form.
undefined
Jan 29, 2025 • 16min

Special Episode: You Missed a Page from Telebot

Join Javi Beltran, a talented Google engineer based in Zurich, as he reminisces about creating the playful Telebot theme song to ease the stress of on-call engineers. He delves into Telebot's evolution, enhancing communication for engineers with its unique paging system. Discover the emotional rollercoaster of the Telebot ringtone and the creative remix journey that brings a modern twist while preserving its charm. This collaboration highlights the fusion of tech culture and music, proving that innovation can also be fun!
undefined
Dec 11, 2024 • 36min

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

Dominic Hutton, Staff SRE at HashiCorp with a rich background in engineering, teams up with Niccolo' Cascarano, Senior Staff SRE at Google and a pro in continuous delivery systems. They dive into the intriguing world of configuration management, comparing imperative and declarative workflows. Listeners will learn how declarative methods simplify complexity while imperative approaches can cater to quick tasks. The importance of managing scripts, navigating synchronization pitfalls, and fostering collaboration between development and operations also takes center stage.
undefined
31 snips
Dec 4, 2024 • 41min

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

Casey Rosenthal, Founder of Cirrusly.ai, and John Allspaw, Principal of Adaptive Capacity Labs, delve into the complexities of resilience in software engineering. They emphasize the crucial human factors that influence system reliability and adaptability during failures. The discussion reveals the pitfalls of traditional incident metrics, advocating for an understanding of qualitative impacts on users. Additionally, they tackle the cultural challenges organizations face in incident management, highlighting the need for transparency and better communication.
undefined
Nov 20, 2024 • 34min

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

Joining the conversation are Christina Schulman, Staff SRE at Google, who focuses on reliability in Google Cloud, and Dr. Laura Maguire, Principal Engineer at Trace Cognitive Engineering, an expert in cognitive systems. They delve into the human side of site reliability engineering, discussing how collaboration and diverse perspectives enhance incident response. Insights include the importance of transparency in learning from failures, managing dependency cycles in complex systems, and the need to embrace complexity to foster resilience in tech environments.
undefined
Nov 13, 2024 • 33min

Maglev: load balancing at Google with Cody Smith and Trisha Weir

Cody Smith, CTO and co-founder of Camu Energy, spent over 14 years at Google and contributed to Maglev. Trisha Weir, with 21 years at Google, is an SRE Department Lead. They uncover the evolution of Maglev, a network load balancer essential for traffic management in data centers. Their discussion highlights the significance of psychological safety and collaboration in tech innovation. They also delve into challenges faced during system rollouts, debugging practices, and the shift from manual to automated network provisioning, showcasing a unique blend of technical and teamwork insights.
undefined
Oct 30, 2024 • 42min

Profiling data with Pat Somaru and Narayan Desai

Narayan Desai, a Principal SRE at Google, and Pat Somaru, a Senior Production Engineer at Meta, delve into the complexities of observability in site reliability engineering. They discuss the challenges of noise reduction and the importance of actionable insights from high-cardinality data. The pair critiques the reliance on superficial metrics, emphasizing the need for deeper analysis to accurately reflect business outcomes. They also explore data profiling's role in enhancing system performance and optimizing resource management for greater efficiency.
undefined
Oct 23, 2024 • 32min

Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.
undefined
Oct 16, 2024 • 34min

SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg  to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.
undefined
Oct 9, 2024 • 44min

Incident Response with Sarah Butt and Vrai Stacey

Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner