Reliability Enablers

Ash Patel & Sebastian Vietz

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more. read.srepath.com

Episodes

Mentioned books

Jun 25, 2024 • 44min

#48 Cutting Down "Toil" aka Manual Work in Software

Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.We hit the jackpot with concepts like:* what is toil according to a 5-point criteria* why even care about toil?* where you can find toil in your software system* Google’s goal for how much work (%) should be toil* the fact that toil isn’t always all that badDon’t have time to listen to what we learned or added to the concepts? Check out the takeaways toward the end of this email.But first…Before we jump into the takeaways, here’s a new segment I’m trying out for newsletters. I’ll highlight a new reliability tool that I think could help you. Do you struggle to visualize your Kubernetes workloads?In that case, have you heard of kube-ops-view?It helps you visualize your complex K8s clusters and everything inside them.For a deeper rundown, visit the LinkedIn post I made about kube-ops-view which shares a few more details. Back to our original programming…Here are key takeaways from our chat* Define and Identify ToilRegularly evaluate your tasks. Identify work that is manual, repetitive, and potentially automatable. Recognize it as toil and prioritize its reduction.* Prioritize AutomationLook for repetitive tasks in your workflow and automate them using tools and scripts to reduce manual interventions and increase efficiency.* Embrace the Role of an SRERealize that the role of an SRE is to improve system reliability proactively. Focus on long-term improvements rather than just responding to immediate issues.* Address Common Sources of ToilIdentify frequent sources of toil like context switching, on-call duties, and release processes. Implement solutions to automate and streamline these areas.* Adopt a Toil Elimination MindsetCultivate a mindset focused on eliminating toil. Regularly discuss and explore automation opportunities with your team to improve processes.* Develop a Culture of Continuous ImprovementEncourage a culture that values reducing manual, repetitive work. Advocate for proactive problem-solving and continuous process enhancement within teams.Until next time, happy toil hunting! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Jun 18, 2024 • 29min

#47 How to Grow Team Impact Through Learning Culture

The common refrain after an incident is “We could and should learn from this”. To me, that alludes to the need for a robust learning culture.We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives.But how often do we explore the nuances of how we are learning?Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (software consultancy) and now does the same work under her own banner.Her work ties in well with the ideas shared by Manuel Pais in episode #45 about how enabling teams can support a continuous learning culture. We tackled issues like the value of certifications, comparing technical with non-technical skills, and more. You can ⁠connect with Sorrel via LinkedInLearn more about what Sorrel does via LaaS.consultingHere’s a bonus section because you read all this way. It covers 5 public outages and how the affected teams could improve their learning culture: 1. Slack Outage (February 2023)Slack experienced a global outage disrupting communication for hours due to backend infrastructure issues. Perhaps the team could focus their learning on more robust infrastructure management and resilience improvement.2. Twitter Algorithm Glitch (April 2023)A glitch in Twitter's algorithm caused timeline issues, stemming from a problematic software update. Perhaps the team could focus their learning on thorough testing and game days to rectify critical system errors swiftly.3. Microsoft Azure AD Outage (March 2023)Azure Active Directory faced a significant outage due to an internal configuration change. Perhaps the team could focus their learning on the importance of rigorous change management and how to address misconfigurations quickly.4. Google Cloud Platform Networking Issue (May 2023)Google Cloud Platform experienced widespread service disruptions from a software bug in its networking infrastructure. Perhaps the team could focus their learning on the need for comprehensive testing and preventing disruptions.5. GitHub Outage (June 2023)GitHub suffered a major outage caused by a cascading failure in its storage infrastructure. Perhaps the team could focus their learning on robust fault-tolerance mechanisms and ways to address the root causes of failures. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Jun 11, 2024 • 24min

#46 Platform Team Design According to Team Team Topologies

I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams.In this second part, we will talk about platform teams. A quick refresher on what platform teams doIn the team topologies context:Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity.They achieve this directive by abstracting away common infrastructure and operational concerns. By doing this, they aim to allow stream-aligned teams to focus on delivering business value.Here are the key takeaways from our conversation For those who don’t have time to listen to this episode (but you’re missing out on a great conversation):* Focus on User-Centric Design: Prioritize the user experience in platform development. Regularly collaborate with internal teams to ensure the platform meets their needs and reduces their pain points.* Build and Maintain Trust: Establish and nurture trust with your platform’s users. Trust is crucial for platform adoption and can prevent resistance thus assuring sustained use.* Justify Platform Value: Continuously demonstrate the value of your platform to management and stakeholders, especially during economic downturns. Highlight its contributions to avoid cuts and maintain support.* Understand Adoption Lifecycle: Recognize that platforms go through different stages of adoption. Identify and support early adopters, and gradually bring in late adopters by showcasing successful use cases.* Enhance Collaboration: Foster open communication between platform teams and other teams. Avoid rigid roadmaps and be adaptable to changing needs to prevent barriers and build stronger internal relationships.* Manage Cognitive Load: Be mindful of the cognitive load on your teams. Simplify processes and reduce unnecessary complexities to enhance productivity and efficiency.* Use Tools to Measure Cognitive Load: Implement tools like Teamperature to assess the cognitive load on your teams regularly. Use the insights to identify and mitigate factors contributing to cognitive overload.* Leverage Experienced Product Managers: Ensure experienced product managers are part of your platform team. They can balance long-term goals with the flexibility needed to adapt to the evolving needs of internal users.I think the uncommon takeaway here is #9 in that platform teams should treat their platform as a product. Product Managers like Paweł Huryn and Marty Cagan are doing great work in laying out the roadmap for product management. Did you end up checking out the reliability workstreams map I published last week?It’s free and can help you stay focused on the right priorities at work.Check it out via this link This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Jun 4, 2024 • 25min

#45 How Team Topologies Can Guide Enabling Teams

I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams.A quick refresher on what enabling teams doIn the team topologies context:Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas.This kind of team is available to provide expertise, guidance, and support to other teams working to adopt new technologies, practices, or skills.In other news…This podcast has a new nameWhat more a fitting moment to announce renaming the SREpath podcast to “The Reliability Enablers” podcast?This name change reinforces our quest to demystify and enable reliability efforts so that more organizations successfully implement SRE principles and beyond. Before we get to the 8 takeawaysHere’s something relevant to enabling reliability work — a reliability workflows map I’ve had in my private notes for years, now going public.What is a workstream? 🤔 You might have heard of “value streams”. They show the end-to-end journey of creating and delivering value to a customer.Workstreams support your value streams. They cover the activities carried out to do so. In summary: Value streams are the goals and workstreams are the activities you do to achieve those goals.Okay, now time for the erudite takeaways that Manuel gave me from our talk.Takeaways from the episodeHere are the key takeaways from our conversation for those who don’t have time to listen (but you’re missing out on a great audio conversation):* Create Enabling Teams: Form SRE-focused enabling teams to facilitate technical training, optimize cloud architecture, improve documentation, and overall help other teams build their capabilities.* Work to Minimize Cognitive Load:Minimize the cognitive load on engineers by centralizing complex and repetitive tasks, allowing engineers to concentrate on innovation and high-value work. You can measure cognitive load and manage it through the Teamperature tool* Facilitate Learning and Adoption of Best Practices:Use SRE enabling teams to educate product teams on critical practices like error budgets and service level objectives, making the learning process gradual and manageable.* Collaborate among Topologies for Effective Tooling:Enable teams should work with platform teams to inform their plans to develop and co-evolve tools and services that support reliability and observability practices, like automated dashboards and alerting systems.* Adapt Approaches Based on Organizational Capacity:Tailor the mix of enabling and platform support based on the organization’s resources and constraints, ensuring flexibility and efficiency.* Avoid Traditional Ops Work for SRE Teams:Ensure SRE teams focus on empowering product teams rather than performing traditional operations tasks, promoting a culture of shared responsibility.* Build an Effective Learning Culture:Foster a culture of continuous learning and improvement, integrating learning opportunities into the daily workflow rather than relying solely on formal training programs.* Scale Capabilities Across the Organization:When needed, scale enabling efforts to build organization-wide capabilities, ensuring that expertise is distributed and not bottlenecked within specialized departments. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

May 30, 2024 • 20min

#44 - Making SLOs Matter to Stakeholders

Dive into the critical role of Service Level Objectives (SLOs) in establishing trust with stakeholders. Discover how mastering SLOs can enhance system reliability and align with customer expectations. The importance of transparency in healthcare software is explored, focusing on communication between vendors and clinicians. Learn about the dangers of unrealistic sales pitches and the necessity of aligning quality standards with SLOs. Finally, hear expert insights on assessing organizational readiness for implementing effective SLOs.

May 28, 2024 • 32min

#43 - SLOs: a Deeper Dive into its Mechanics

Dive deep into the mechanics of Service Level Objectives (SLOs). Discover the importance of starting small and iterating based on real-world feedback. Learn how to defend and enforce SLOs for meaningful impact. Explore the need for continuous improvement and alignment with user expectations. Enhance communication skills to bridge gaps between tech and non-tech teams. Ultimately, it’s all about crafting SLOs that evolve with your system and truly reflect user needs.

May 21, 2024 • 29min

#42 - Hitting Software SLA Targets through SLOs and SLIs

In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show:* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

May 14, 2024 • 25min

#41 Curbing High Observability Costs

Sofia Fosdick, a Senior Account Executive at Honeycomb.io with experience in observability, shares her expertise on managing observability costs. She delves into the challenges organizations face with rising costs and the often underutilized data. Sofia emphasizes aligning observability expenses with business value and advocates for transitioning from time-based to event-driven data strategies to optimize user experience. Her practical insights aim to help avoid astronomical bills while ensuring effective data management.

May 7, 2024 • 28min

#40 How to Enable Observability for Success

Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability. He’s a senior systems engineer at IKEA and is part of its observability enabling team. Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this. You can ⁠connect with Timothy via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Apr 30, 2024 • 25min

#39 How Chaos Engineering Helps Reduce Incident Risk

Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation!You can ⁠connect with Ananth via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner