Reliability Enablers

Ash Patel & Sebastian Vietz

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more. read.srepath.com

Episodes

Mentioned books

Jul 15, 2025 • 31min

#67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran

In a candid discussion, Dave O’Connor, a seasoned Google SRE veteran with 16 years under his belt, sheds light on the pitfalls many organizations face while trying to implement site reliability engineering. He critiques the adoption trap, explaining why merely following the SRE book can be misleading. O’Connor highlights the cost of engineers' burnout and questions the effectiveness of incident command roles. Delving into the challenges of balancing reliability with business needs, he offers insights on evolving organizational practices in the tech landscape.

Jul 1, 2025 • 30min

#66 - Unpacking 2025 SRE Report’s Damning Findings

I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care.This episode was prompted by the 2025 Catchpoint SRE Report, which dropped some damning but all-too-familiar findings:* 53% of orgs still define reliability as uptime only, ignoring degraded experience and hidden toil* Manual effort is creeping back in, reversing five years of automation gains* 41% of engineers feel pressure to ship fast, even when it undermines long-term stabilityTo unpack what this actually means inside organizations, I sat down with Sebastian Vietz, Director of Reliability Engineering at Compass Digital and co-host of the Reliability Enablers podcast. Sebastian doesn’t just talk about technical fixes — he focuses on the organizational frictions that stall change, burn out engineers, and leave “reliability” as a slide deck instead of a lived practice.We dig into:* How SREs get stuck as messengers of inconvenient truths* What it really takes to move from advocacy to adoption — without turning your whole org into a cost center* Why tech is more like milk than wine (Sebastian explains)* And how SREs can strengthen—not compete with—security, risk, and compliance teamsThis one’s for anyone tired of reliability theatrics. No kumbaya around K8s here. Just an exploration of the messy, human work behind making systems and teams more resilient. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Jun 17, 2025 • 28min

#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?” But in the energy sector? There is no acceptable downtime. Not even a little.In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it.What makes this episode different is that Wade isn’t a reliability engineer by title, but it’s baked into everything his team touches. And that matters more than ever as software creeps deeper into operational technology (OT), and the cloud tries to stake its claim in critical systems.We cover:* Why 100% uptime is the minimum bar, not a stretch goal* How the rise of renewables has increased system complexity — and what that means for monitoring* Why bespoke integration and SCADA spaghetti are still normal (and here to stay)* The reality of cloud risk in critical infrastructure (“the cloud is just someone else’s computer”)* What software engineers need to understand if they want their products used in serious environmentsThis isn’t about observability dashboards or DevOps rituals. This is reliability when the lights go out and people risk getting hurt if you get it wrong.And it’s a reminder: not every system lives in a feature-driven world. Some systems just have to work. Always. No matter what. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Jan 28, 2025 • 21min

#64 - Using AI to Reduce Observability Costs

In this discussion, Ruchir Jha, a lead engineer at Netflix’s observability group and founder of Cardinal, shares insights on reducing observability costs. He addresses the challenge of tool sprawl, recommending strategies to streamline the use of 5-15 tools effectively. Ruchir also highlights how AI can simplify data analysis, empowering non-technical team members to make informed decisions. Additionally, he explores the evolving role of OpenTelemetry in modern observability, emphasizing its flexibility and risk mitigation benefits.

Nov 12, 2024 • 29min

#63 - Does "Big Observability" Neglect Mobile?

Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.* Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.* Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.* Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.* Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.* Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.* Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.* Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.* OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability.* SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.* Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.* Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Nov 5, 2024 • 36min

#62 - Early Youtube SRE shares Modern Reliability Strategy

Andrew Fong, Co-founder and CEO of Prodvana and former VP of Infrastructure at Dropbox, dives into the evolution of Site Reliability Engineering (SRE) amidst changing tech landscapes. He advocates for addressing problems over rigid roles, emphasizing reliability and efficiency. Andrew explores how AI is reshaping SRE, the balance between innovation and operational management, and the importance of a strong organizational culture. His insights provide a values-first approach to tackle engineering challenges, fostering collaboration and a proactive reliability mindset.

Oct 22, 2024 • 38min

#61 Scott Moore on SRE, Performance Engineering, and More

Scott Moore, a performance engineer with decades of experience and a knack for educational content, shares his insights on software performance. He discusses how parody music videos make performance engineering engaging and accessible. The conversation delves into the importance of redefining operational requirements and how performance metrics should not be overlooked. Scott highlights the relationship between performance engineering and reliability, and how collaboration can reduce team burnout. He also reveals how a performance-centric culture can optimize cloud costs and improve development processes.

Oct 1, 2024 • 31min

#60 How to NOT fail in Platform Engineering

Ankit, who started programming at age 11 and naturally gravitated towards platform engineering, shares his insights on this evolving field. He discusses how platform engineering aids team efficiency through self-service capabilities. Ankit highlights the challenges of turf wars among DevOps, SRE, and platform engineering roles, as well as the dysfunctions caused by rigid ticketing systems. He emphasizes the need for autonomy and reducing cognitive load to foster creativity and effective teamwork, drawing from his rich experiences across various sectors.

Sep 24, 2024 • 8min

#59 Who handles monitoring in your team and how?

Why many copy Google’s monitoring team setup* Google’s Influence. Google played a key role in defining the concept of software reliability.* Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settingsBUT there’s a problem:* It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.What is Google’s monitoring approach within teams?Here’s the thing that Google does:* Google assigns one or two people per team to manage monitoring.* Even with centralized infrastructure, a dedicated person handles monitoring.* Many organizations use a separate observability team, unlike Google's integrated approachIf your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google’s model to the tee. Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective.Can your team mimic Google’s model?Here are a few things you should factor in:Size mattersGoogle's model works because of its scale and technical complexity. Many organizations don’t have the size, resources, or technology to replicate this.What are the options for your team?Dedicated monitoring team (very popular but $$$)If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it’s not something that a startup or SME can easily justify. Dedicate SREs to monitoring work (effective but difficult to manage)You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning.Internal monitoring experts (useful but hard capability)One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team’s needs. This should be how we get monitoring work done, but it’s hard to get volunteers across a majority of teams. Transitioning monitoring from project work to maintenance2 distinct phasesInitial Setup (the “project”) SREs may help set up the monitoring/observability infrastructure. Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively.Post-project phase (“keep the lights on”)Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that?Who will maintain the monitoring system?Answer: usually not the same teamAfter the project phase, a new set of people—often different from the original team—typically handles maintenance.Options to consider (once again)* Spin up a monitoring/observability team. Create a dedicated team for observability infrastructure.* Take a decentralized approach. Engineers across various teams take on observability roles as part of their regular duties.* Internal monitoring/observability experts. They can take responsibility for monitoring and ensure best practices are followed.The key thing to remember here is…Adapt to Your Organizational ContextOne size doesn’t fit allGoogle's model may not work for everyone. Tailor your approach based on your organization’s specific needs.The core principle to keep in mindAs long as people understand why monitoring/observability matters and pay attention to it, you're on the right track.Work according to engineer awarenessIf engineers within product and other non-operations teams are aware of monitoring: You can attempt to **decentralize the effort** and involve more team members.If awareness or interest is low: consider **dedicated observability roles** or an SRE team to ensure monitoring gets the attention it needs.In conclusionThere’s no universal solution. Whether you centralize or decentralize monitoring depends on your team’s structure, size, and expertise. The important part is ensuring that observability practices are understood and implemented in a way that works best for your organization.PS. Rather than spend an hour on writing, I decided to write in the style I normally use in a work setting i.e. “executive short-hand”. Tell me what you think. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Sep 17, 2024 • 8min

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come. The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts. This interrupts workflows, affects personal time, and even disrupts sleep.Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise. When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard.Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit. When instrumenting your systems, be intentional about what data you collect and transport. Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem.To combat this, focus on:* Being Deliberate with Data. Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals.* Filtering Data Effectively. Improve how you filter incoming data to eliminate less relevant information and retain what's crucial.* Refining Alerts. Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings.Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently. He shared that managing millions of alerts, often filled with noise, is a significant issue. His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don’t impact the user journey.According to Dan, the anatomy of a good alert includes:* A run book* A defined priority level* A corresponding dashboard* Consistent labels and tags* Clear escalation paths and ownershipTo elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes.The learning point is simple: aim for quality over quantity. By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner