Reliability Enablers cover image

Reliability Enablers

Latest episodes

undefined
Aug 15, 2024 • 10min

#53 What's Missing in Incident Response Processes?

Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust.Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority setting. According to Vladislav, at the beginning of your SRE journey, it’s not going to be focused on incident response in terms of setting up an incident response process, but more on core SRE artifacts like SLIs, availability measurement, SLOs, etc. And now we are safely investing more into the customer-facing features and things like this. So this is going to be the core SRE concepts. But then at some point, once you've got these things, more or less established in the organization. Understanding and Leveraging SLOsOnce your Service Level Objectives (SLOs) are well-defined and refined over time, they should accurately reflect user and customer experiences. Your SLOs are no longer just initial metrics; they’ve been validated through production. Product managers should now be able to use this data to make informed decisions about feature prioritization. This foundational work is crucial because it sets the stage for integrating a formal incident response process effectively.Implementing a Formal Incident ResponseBefore you overlay a formal incident response process, ensure that you have the cultural and technical groundwork in place. Without this, the process might not be as effective. When the foundational SLOs and organizational culture are strong, a well-structured incident response process can significantly enhance its effectiveness.Coordinating During Major IncidentsWhen a significant incident occurs, detecting it through SLO breaches is just the beginning. You need a system in place to coordinate responses across multiple teams. Consider appointing incident commanders and coordinators, as recommended in PagerDuty’s documentation, to manage this coordination. Develop a lightweight process to guide how incidents are handled.Classifying IncidentsEstablish an incident classification scheme to differentiate between types of incidents. This scheme should include priorities such as Priority One, Priority Two, and Priority Three. Due to the inherently fuzzy nature of incidents, your classification system should also include guidelines for handling ambiguous cases. For instance, if uncertain whether an incident is Priority One or Two, default to Priority One.Deriving Actions from Incident ClassificationBased on the incident classification, outline specific actions. For example, Priority One incidents might require immediate involvement from an incident commander. They might take the following actions:* Create a communication channel, assemble relevant teams, and start coordination. * Simultaneously inform stakeholders according to their priority group. * Define stakeholder groups and establish protocols for notifying them as the situation evolves.Keep Incident Response Processes Simple and AccessibleEnsure that your incident response process is concise and easily understandable. Ideally, it should fit on a single sheet of paper. Complexity can lead to confusion and inefficiencies, so aim for simplicity and clarity in your process diagram. This approach ensures that the process is practical and can be followed effectively during an incident.Preparing Your OrganizationAn effective incident response process relies on an organization’s readiness for such rigor. Attempting to implement this process in an organization not yet mature enough may result in poor adherence during critical times. Make sure your organization is prepared to follow the established procedures.For a deeper dive into these concepts, consider reading "Establishing SRE Foundations," available on Amazon and other book retailers. For further inquiries, you can also connect with the author, Vlad, on LinkedIn. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Aug 13, 2024 • 29min

Can ITIL Benefit from Site Reliability Engineering?

According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale.However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function.Dr. Vladislav Ukis is well qualified to talk about reliability, being at Siemens Healthineers and leading 250 people globally to offer their cloud platform running off Microsoft Azure.We discussed key concepts from his book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations.Unlike other technical books in this field, Dr Ukis’ book is aimed at technology professionals who are beginners to the reliability journey. This is different from the Site Reliability Engineering (2016) book by Google, which covers all the bells and whistles that SRE encompasses. That book requires a degree of prior knowledge and also prior experience in the field. Vlad wanted to make it more accessible:What I did with my book is to say, ‘Okay, so now you've never done operations, but you now are thrown in the world of online services where you have to operate them. How do you get started?’ So this is what the book is for. So for people who want to learn how to get started in the world of operating online services.ITIL was originally developed by the UK government in the 80s to improve IT governance. It is best related to SRE through its service management and incident management components. But it’s for managing systems that are more predictable and can be handled through strict process control.Modern product delivery doesn’t have the luxury of bureaucratic levels of predictability that older IT services have. It requires a more engineer-oriented approach to solving problems/incidents and providing services. So how was Vlad’s experience bringing SRE into an organization that previously had run solely on the ITIL model?Siemens Healthineers for many years operated like a traditional software development organization. In other words, they were developing on-prem software, not cloud software. The company would ship the physical software product to its hospital customers and then those hospitals would have the software operated and supported by their IT departments. The change came about when Siemens Healthineers began to work on a new digital health platform, which would be cloud-based from the beginning. So they would no longer ship physical software in discs to customers, but provide online services in the cloud centrally for the customers to use.The early days were haphazardly done with the software deployed to the cloud with no major issues. Not many customers were on the cloud platform so the team could get away with “handcrafted operating procedures”.But as traffic and service count started to rise rapidly, the Healthineers team learned that they needed a more professional approach. They began to understand that their initial approach to operations could not continue as-is.This is when Vladislav began to drive SRE practices in the organization. This was a sub-30-minute conversation that covered a lot of ground that would be relevant to the needs of organizations looking to transition to product delivery of online services at scale. Have a listen. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Aug 6, 2024 • 37min

#52 Navigating Complexity within Incidents

Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering. But it is!Our systems are becoming more complex and so are the resulting incidents.Learning about complexity can help reliability folk go into an incident with less anxiety, which we’ll explore in this episode.We'll explore the causes of complexity in incidents and how the Cynefin framework classifies incidents.We'll also deep dive into the concept of complexity itself and dispel a common issue where it gets mixed up with complicatedness.About SonjaSonja is a co-founder of Complexity Fit and founder of More Beyond focusing on helping teams build capacity for sensemaking, collaboration, and wayfinding.She has a background in programming from her early career as a meteorologist, having worked in C and Fortran, and then progressing to working as a web developer.You can connect with Sonja to learn more about complexity via LinkedIn. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Jul 30, 2024 • 10min

#51 Whitebox vs Blackbox Monitoring

Have you got complete monitoring of your software in effect? Are you sure? Google's SREs break monitoring down to white box versus black box monitoring.It's not the same as internal versus external monitoring, which we'll explore further.We'll cover topics like:- (quickly) What is monitoring?- What is whitebox monitoring?- What is black box monitoring?- The rising importance of blackbox monitoringThis is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) book. Chapter written by Rob Ewaschuk and edited by Betsy Beyer. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Jul 9, 2024 • 25min

#50 Making Better Sense of Observability Data

Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data.We crammed into just under 25 minutes ideas like these 7 takeaways:* Reasserting the Need to Monitor Four Golden Signals: Focus on latency, traffic, errors, and saturation for effective system monitoring and management.* Prioritize Customer Health: in Jack’s words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers for a more comprehensive view of your system's impact.* Apply Mathematical Techniques: Incorporate advanced mathematical concepts, like the Nyquist Shannon law and T Digest algorithm, to enhance data accuracy and observability metrics.* Build Accurate Percentiles: Implement techniques to accurately reproduce percentiles from raw data to ensure reliable performance metrics.* Manage High Cardinality Data: Develop strategies to handle high cardinality data without overwhelming your resources, ensuring you extract valuable insights.* Standardize Log Records: Use readily available frameworks to emit standardized log records makes data easier to process and visualize.* Handle High-Velocity Data Efficiently: Develop methods for collecting and processing high-velocity data without incurring prohibitive costs.Watch Jack’s Monitorama talk via this link: https://vimeo.com/843996971 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Jul 2, 2024 • 30min

#49 Alert Fatigue is Still an Issue - Here's How We Fix it

Dan Ravenstone, a Staff Engineer at Top Hat and a platform engineering expert, shares insights on tackling alert fatigue, a pressing issue in monitoring systems. He emphasizes the need for regular updates to monitoring systems and crafting alerts that truly resonate with user experience. By reducing unnecessary noise and focusing on actionable alerts, organizations can enhance incident management. Ravenstone also mentions the importance of leadership support and understanding the user journey to ensure alerts are meaningful and enhance employee well-being.
undefined
Jun 25, 2024 • 44min

#48 Cutting Down "Toil" aka Manual Work in Software

Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.We hit the jackpot with concepts like:* what is toil according to a 5-point criteria* why even care about toil?* where you can find toil in your software system* Google’s goal for how much work (%) should be toil* the fact that toil isn’t always all that badDon’t have time to listen to what we learned or added to the concepts? Check out the takeaways toward the end of this email.But first…Before we jump into the takeaways, here’s a new segment I’m trying out for newsletters. I’ll highlight a new reliability tool that I think could help you. Do you struggle to visualize your Kubernetes workloads?In that case, have you heard of kube-ops-view?It helps you visualize your complex K8s clusters and everything inside them.For a deeper rundown, visit the LinkedIn post I made about kube-ops-view which shares a few more details. Back to our original programming…Here are key takeaways from our chat* Define and Identify ToilRegularly evaluate your tasks. Identify work that is manual, repetitive, and potentially automatable. Recognize it as toil and prioritize its reduction.* Prioritize AutomationLook for repetitive tasks in your workflow and automate them using tools and scripts to reduce manual interventions and increase efficiency.* Embrace the Role of an SRERealize that the role of an SRE is to improve system reliability proactively. Focus on long-term improvements rather than just responding to immediate issues.* Address Common Sources of ToilIdentify frequent sources of toil like context switching, on-call duties, and release processes. Implement solutions to automate and streamline these areas.* Adopt a Toil Elimination MindsetCultivate a mindset focused on eliminating toil. Regularly discuss and explore automation opportunities with your team to improve processes.* Develop a Culture of Continuous ImprovementEncourage a culture that values reducing manual, repetitive work. Advocate for proactive problem-solving and continuous process enhancement within teams.Until next time, happy toil hunting! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Jun 18, 2024 • 29min

#47 How to Grow Team Impact Through Learning Culture

The common refrain after an incident is “We could and should learn from this”. To me, that alludes to the need for a robust learning culture.We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives.But how often do we explore the nuances of how we are learning?Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (software consultancy) and now does the same work under her own banner.Her work ties in well with the ideas shared by Manuel Pais in episode #45 about how enabling teams can support a continuous learning culture. We tackled issues like the value of certifications, comparing technical with non-technical skills, and more. You can ⁠connect with Sorrel via LinkedInLearn more about what Sorrel does via LaaS.consultingHere’s a bonus section because you read all this way. It covers 5 public outages and how the affected teams could improve their learning culture: 1. Slack Outage (February 2023)Slack experienced a global outage disrupting communication for hours due to backend infrastructure issues. Perhaps the team could focus their learning on more robust infrastructure management and resilience improvement.2. Twitter Algorithm Glitch (April 2023)A glitch in Twitter's algorithm caused timeline issues, stemming from a problematic software update. Perhaps the team could focus their learning on thorough testing and game days to rectify critical system errors swiftly.3. Microsoft Azure AD Outage (March 2023)Azure Active Directory faced a significant outage due to an internal configuration change. Perhaps the team could focus their learning on the importance of rigorous change management and how to address misconfigurations quickly.4. Google Cloud Platform Networking Issue (May 2023)Google Cloud Platform experienced widespread service disruptions from a software bug in its networking infrastructure. Perhaps the team could focus their learning on the need for comprehensive testing and preventing disruptions.5. GitHub Outage (June 2023)GitHub suffered a major outage caused by a cascading failure in its storage infrastructure. Perhaps the team could focus their learning on robust fault-tolerance mechanisms and ways to address the root causes of failures. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
7 snips
Jun 11, 2024 • 24min

#46 Platform Team Design According to Team Team Topologies

I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams.In this second part, we will talk about platform teams. A quick refresher on what platform teams doIn the team topologies context:Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity.They achieve this directive by abstracting away common infrastructure and operational concerns. By doing this, they aim to allow stream-aligned teams to focus on delivering business value.Here are the key takeaways from our conversation For those who don’t have time to listen to this episode (but you’re missing out on a great conversation):* Focus on User-Centric Design: Prioritize the user experience in platform development. Regularly collaborate with internal teams to ensure the platform meets their needs and reduces their pain points.* Build and Maintain Trust: Establish and nurture trust with your platform’s users. Trust is crucial for platform adoption and can prevent resistance thus assuring sustained use.* Justify Platform Value: Continuously demonstrate the value of your platform to management and stakeholders, especially during economic downturns. Highlight its contributions to avoid cuts and maintain support.* Understand Adoption Lifecycle: Recognize that platforms go through different stages of adoption. Identify and support early adopters, and gradually bring in late adopters by showcasing successful use cases.* Enhance Collaboration: Foster open communication between platform teams and other teams. Avoid rigid roadmaps and be adaptable to changing needs to prevent barriers and build stronger internal relationships.* Manage Cognitive Load: Be mindful of the cognitive load on your teams. Simplify processes and reduce unnecessary complexities to enhance productivity and efficiency.* Use Tools to Measure Cognitive Load: Implement tools like Teamperature to assess the cognitive load on your teams regularly. Use the insights to identify and mitigate factors contributing to cognitive overload.* Leverage Experienced Product Managers: Ensure experienced product managers are part of your platform team. They can balance long-term goals with the flexibility needed to adapt to the evolving needs of internal users.I think the uncommon takeaway here is #9 in that platform teams should treat their platform as a product. Product Managers like Paweł Huryn and Marty Cagan are doing great work in laying out the roadmap for product management. Did you end up checking out the reliability workstreams map I published last week?It’s free and can help you stay focused on the right priorities at work.Check it out via this link This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Jun 4, 2024 • 25min

#45 How Team Topologies Can Guide Enabling Teams

I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams.A quick refresher on what enabling teams doIn the team topologies context:Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas.This kind of team is available to provide expertise, guidance, and support to other teams working to adopt new technologies, practices, or skills.In other news…This podcast has a new nameWhat more a fitting moment to announce renaming the SREpath podcast to “The Reliability Enablers” podcast?This name change reinforces our quest to demystify and enable reliability efforts so that more organizations successfully implement SRE principles and beyond. Before we get to the 8 takeawaysHere’s something relevant to enabling reliability work — a reliability workflows map I’ve had in my private notes for years, now going public.What is a workstream? 🤔 You might have heard of “value streams”. They show the end-to-end journey of creating and delivering value to a customer.Workstreams support your value streams. They cover the activities carried out to do so. In summary: Value streams are the goals and workstreams are the activities you do to achieve those goals.Okay, now time for the erudite takeaways that Manuel gave me from our talk.Takeaways from the episodeHere are the key takeaways from our conversation for those who don’t have time to listen (but you’re missing out on a great audio conversation):* Create Enabling Teams: Form SRE-focused enabling teams to facilitate technical training, optimize cloud architecture, improve documentation, and overall help other teams build their capabilities.* Work to Minimize Cognitive Load:Minimize the cognitive load on engineers by centralizing complex and repetitive tasks, allowing engineers to concentrate on innovation and high-value work. You can measure cognitive load and manage it through the Teamperature tool* Facilitate Learning and Adoption of Best Practices:Use SRE enabling teams to educate product teams on critical practices like error budgets and service level objectives, making the learning process gradual and manageable.* Collaborate among Topologies for Effective Tooling:Enable teams should work with platform teams to inform their plans to develop and co-evolve tools and services that support reliability and observability practices, like automated dashboards and alerting systems.* Adapt Approaches Based on Organizational Capacity:Tailor the mix of enabling and platform support based on the organization’s resources and constraints, ensuring flexibility and efficiency.* Avoid Traditional Ops Work for SRE Teams:Ensure SRE teams focus on empowering product teams rather than performing traditional operations tasks, promoting a culture of shared responsibility.* Build an Effective Learning Culture:Foster a culture of continuous learning and improvement, integrating learning opportunities into the daily workflow rather than relying solely on formal training programs.* Scale Capabilities Across the Organization:When needed, scale enabling efforts to build organization-wide capabilities, ensuring that expertise is distributed and not bottlenecked within specialized departments. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode