
Reliability Enablers
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more. read.srepath.com
Latest episodes

Apr 2, 2024 • 35min
#35 Boosting Your Observability Data's Usability
The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected?Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data.He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space. Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on collecting data to leveraging it for actionable insights. You can connect with Richard via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Mar 26, 2024 • 23min
#34 From Cloud to Concrete: Should You Return to On-Prem?
This episode continues our coverage of Chapter 2 of the Site Reliability Engineering book (2016). We talk about the age-old debate of cloud vs on-prem, which is analogous to that other debate we have in the technology of build vs buy. Here are key takeaways from our conversation: Adapt your storage solutions to business needs: Understand the diverse storage options available and tailor them to specific business needs, considering factors like data type, access patterns, and scalability requirements. Optimize your load balancing: Implement global load balancing strategies to optimize user experience and performance by directing traffic to the nearest data center to minimize latency, and maximize resource utilization. Don't hesitate to continuously evaluate your cloud: Assess the suitability of cloud solutions against your organization's needs, considering factors like cost, control, scalability, and security, and be open to reevaluating decisions based on evolving requirements. Make strategic decisions for your operations footprint: Lean on decisions based on thorough analysis that considers: Encourage objective evaluation and formal planning processes in decision-making: avoid emotional reactions or being swayed by external influences, to ensure decisions are based on sound analysis and truly aligned with organizational goals. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Mar 19, 2024 • 23min
#33 Inside Google's Data Center Design
This episode covers Chapter 2 of the Site Reliability Engineering book (2016). In this first part, we talk about the intricacies of data center design outlined in the book. One thing is for sure. Building a data center for your own needs is HARD work with many considerations you must make.Here are key takeaways from our conversation: Importance of understanding data center fundamentals: Even if you're not operating at the scale of companies like Google, understanding the fundamentals behind data center infrastructure can help. This knowledge can inform decisions on cloud services, high availability strategies, and the architectural design of systems to ensure resilience and scalability. The impetus to leverage cloud infrastructure: The transition from traditional on-premises infrastructure to cloud-based solutions is a critical trend. Organizations can learn from how tech giants manage resources efficiently at scale, to improve their resource allocation. Cyclical trends in technology adoption: trends in technology are cyclical and that can inform strategic decisions. As there's a current discussion around moving from cloud-centric models back to more traditional data center approaches, understanding the history and evolution of tech infrastructure can prepare organizations to adapt to and anticipate future shifts in the technological landscape. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Mar 14, 2024 • 17min
#32 Clarifying Platform Engineering's Role (with Ajay Chankramath) BONUS EP
Will Platform Engineering replace DevOps or SRE or both? I don’t think this is the case at all. Neither does Ajay Chankramath.He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. I’d take his word for it since he’s held senior leadership roles in release engineering and more since 2002.In this bonus episode of the SREpath podcast, Ajay shared his perspective on the debate about SRE vs DevOps vs Platform Engineering. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Mar 12, 2024 • 27min
#31 Introduction to FinOps (with Ajay Chankramath)
FinOps is on the tip of many tongues in the software space right now, as we try to curb our cloud costs. Ajay Chankramath has given talks on FinOps at conferences like the DevOps Enterprise Summit (DOES) among others.He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. His peers like Martin Fowler and Neal Ford have originated ideas like refactoring, microservices, and more.He shared practical advice for avoiding a harsh, restrictive cost control approach and instead taking a holistic financial view of your software operations. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Mar 7, 2024 • 37min
#30 Clearing Delusions in Observability (with David Caudill)
Observability is going through interesting times. David Caudill believes that delusions are getting in the way of our success in this area. He's a senior engineering manager at Capital One, a US-based bank. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Feb 27, 2024 • 31min
#29 - Reacting to Google's SRE book 2016 (Chapter 1 Part 2)
Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google's Site Reliability Engineering (2016) book by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: Monitoring is one of the primary means by which service owners keep track of a system's health and availability. Efficient use of resources is important anytime a service cares about money. Humans add latency, even if a given system experiences more actual failures. A system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands on intervention. SRE has found that roughly, 70 percent of outages are due to changes in a live system. Best practices in this domain use automation to accomplish implementing progressive rollouts. Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand, the required availability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Feb 20, 2024 • 26min
#28 - Reacting to Google's SRE Book 2016 (Chapter 1 Part 1)
Sebastian and I got together to react to and discuss 5 passages from Chapter 1 of Google's Site Reliability Engineering book (2016) by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: The sysadmin approach and the accompanying development ops split have a number of disadvantages and pitfalls Google has chosen to run our systems with a different approach. Our Site Reliability Engineering teams focus on hiring software engineers to run our products The term DevOps emerged in industry. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions. Google caps operational work for SREs at 50 percent of their time. Their remaining time should be spent using their coding skills on project work. Product development and SRE teams can enjoy a productive working relationship by eliminating the structural conflict in their respective goals. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Feb 13, 2024 • 16min
#27 - Growing as a Site Reliability Engineer (Part 3)
Third and final instalment of the Growing as an SRE series covering practical ideas for planning your career progression This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Feb 8, 2024 • 19min
#26 - Growing as a Site Reliability Engineer (Part 2)
In part 1, we covered the first truth - that you don't grow in your career merely through tenure. That was a simple one. Let's explore 2 more truths that are somewhat trickier...Background music credit: Luna by KaizanBlue This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com