Reliability Enablers cover image

Reliability Enablers

Latest episodes

undefined
May 30, 2024 • 20min

#44 - Making SLOs Matter to Stakeholders

Dive into the critical role of Service Level Objectives (SLOs) in establishing trust with stakeholders. Discover how mastering SLOs can enhance system reliability and align with customer expectations. The importance of transparency in healthcare software is explored, focusing on communication between vendors and clinicians. Learn about the dangers of unrealistic sales pitches and the necessity of aligning quality standards with SLOs. Finally, hear expert insights on assessing organizational readiness for implementing effective SLOs.
undefined
6 snips
May 28, 2024 • 32min

#43 - SLOs: a Deeper Dive into its Mechanics

Dive deep into the mechanics of Service Level Objectives (SLOs). Discover the importance of starting small and iterating based on real-world feedback. Learn how to defend and enforce SLOs for meaningful impact. Explore the need for continuous improvement and alignment with user expectations. Enhance communication skills to bridge gaps between tech and non-tech teams. Ultimately, it’s all about crafting SLOs that evolve with your system and truly reflect user needs.
undefined
May 21, 2024 • 29min

#42 - Hitting Software SLA Targets through SLOs and SLIs

In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show:* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
May 14, 2024 • 25min

#41 Curbing High Observability Costs

Sofia Fosdick, a Senior Account Executive at Honeycomb.io with experience in observability, shares her expertise on managing observability costs. She delves into the challenges organizations face with rising costs and the often underutilized data. Sofia emphasizes aligning observability expenses with business value and advocates for transitioning from time-based to event-driven data strategies to optimize user experience. Her practical insights aim to help avoid astronomical bills while ensuring effective data management.
undefined
4 snips
May 7, 2024 • 28min

#40 How to Enable Observability for Success

Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability. He’s a senior systems engineer at IKEA and is part of its observability enabling team. Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this. You can ⁠connect with Timothy via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Apr 30, 2024 • 25min

#39 How Chaos Engineering Helps Reduce Incident Risk

Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation!You can ⁠connect with Ananth via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Apr 23, 2024 • 24min

#38 The Real Cost of Software Reliability & Downtime

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all.Here are key takeaways from our conversation:* Prioritize Risk Mitigation: Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively.* Consider Cost-Effectiveness: When aiming to improve reliability, consider the cost-effectiveness of incremental improvements. Evaluate the balance between investment in reliability and the value it brings to your organization.* Advocate Continuously: Continuously advocate for the importance of reliability engineering within your organization. Communicate transparently about the value SRE teams add and the impact of their work on the organization's success.* Explore Alternative Metrics: Explore alternative availability metrics beyond traditional time-based measurements. Consider event-based metrics to gain a more nuanced understanding of service availability and performance.* Embrace Regional Focus: Shift from relying solely on global availability metrics to more granular regional metrics. Understand the varying impacts on different customer audiences and prioritize improvements accordingly.* Navigate Regulatory Challenges: Be mindful of regulatory challenges, such as GDPR, and understand their implications on service availability and reliability. Adapt strategies and solutions to comply with regulations while maintaining operational efficiency.* Align Reliability with Revenue: Recognize the direct correlation between service availability and revenue generation, particularly for revenue-driven services like ad platforms. Invest in reliability engineering to ensure consistent revenue streams.* Tier Services Strategically: Implement a tiered approach to prioritize reliability efforts, with revenue-generating services like ad platforms placed in the top tier. Allocate resources based on the criticality of services to the organization's objectives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Apr 16, 2024 • 30min

#37 An SRE Approach to Managing Technology Risk

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective. We'll cover how it's very different to the typical IT risk management mindset. Here are key takeaways from our conversation: Embrace Risk with Velocity: Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk management and the speed of innovation. Reevaluate Risk Management Approaches: Challenge traditional approaches to risk management, especially in larger organizations with extensive governance procedures. Explore alternative methods that prioritize agility and efficiency without compromising reliability. Conceptualize Risk as a Continuum: View risk as a continuous spectrum and assess it based on various dimensions, such as the complexity of changes, the criticality of systems, and the impact on user experience. Continuously evaluate and adjust risk management strategies accordingly. Balance Stability and Innovation: Recognize that extreme reliability comes at a cost and may hinder the pace of innovation. Aim for an optimal balance between stability and innovation, prioritizing user satisfaction and efficient service operations. Implement Service-Level Objectives (SLOs): Deliver services with explicitly delineated levels of service, allowing clients to make informed risk and cost trade-offs when building their systems. Define SLOs based on the importance and criticality of services to enable better decision-making. Visualize Risk Assessment: Utilize visual representations, such as whiteboard diagrams, to assess and communicate different levels of risk within your software systems. Encourage collaborative discussions among team members to determine acceptable risk levels. Prioritize Customer Impact: Consider the impact of changes on customer experience and prioritize risk management efforts accordingly. Differentiate between critical user journeys and cosmetic changes to allocate scrutiny appropriately. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
undefined
Apr 9, 2024 • 27min

#36 Avoiding Critical Platform Engineering Mistakes

Abby Bangser, a Principal Engineer at Syntasso and former SRE, shares her expertise on the evolving landscape of platform engineering. She emphasizes the importance of concrete definitions and maturity models in navigating this transition. Abby cautions against confusing developer portals with fully functional platforms and discusses the need for customizable, self-service solutions to enhance developer experiences. The conversation also highlights the socio-technical dynamics of platform engineering and the significance of aligning technology with organizational goals.
undefined
Apr 2, 2024 • 35min

#35 Boosting Your Observability Data's Usability

The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected?Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data.He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space. Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on collecting data to leveraging it for actionable insights. You can connect with Richard via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode