

Slight Reliability
Stephen Townshend
Learning SRE, one day at a time.
Episodes
Mentioned books

Feb 25, 2025 • 30min
Observability Maturity with Ádám Tóth (Episode 92)
Send us a textThis week Adam and I get philosophical about what constitutes maturity in the field of observability. We tackle questions such as...💸 Does your org treat observability as a cost centre or a value add?🔥 Are you using observability reactively to solve problems? Or proactively to build better products and services?👤 Is your observability connected to your users and business in a meaningful way?🌐 Is monitoring the social media sentiment of your product part of observability?...and much more.You can find Adam at:LinkedIn: https://www.linkedin.com/in/adam-toth-innovateq/InnovaTeQ website: https://innovateq.io/I mentioned the 'This Is Fine!' podcast about resilience engineering. Find it on Spotify or at https://www.thisisfinepod.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Jan 21, 2025 • 16min
Head in the Clouds (Episode 91)
Send us a textIn this episode I explore the challenges of achieving unified observability when integrating with SaaS products and services. I cover:🌊 The new wave of mega-complex SaaS⚗️ Challenges integrating SaaS with our observability pipelines👩🦯 How the lack of SaaS autonomy limits the effectiveness of OpenTelemetry💰 Paying twice to ingest, store, and search telemetry📈 Monitoring and predicting SaaS observability costs...and much more.Shout out to Mark Chiavaroli (and apologies for mispronouncing your surname multiple times), Damian Sharrock, and Reece Hewitt for bouncing ideas on this topic.The 'Is it observable?' series can be found here: https://isitobservable.io/...and you can find Henrik on LinkedIn: https://www.linkedin.com/in/hrexed/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Dec 10, 2024 • 18min
Non-Prod Reliability Engineering + 2024 Wrap (Episode 90)
Send us a textThis week I check in and give an update on work, life, and my attempts at bringing to life SRE practices in the world of non-production environment management.You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

Sep 3, 2024 • 26min
Slight Reliability Episode 89 - Blameless Post-mortems with Karanveer Anand
Send us a textThis week I'm joined by Karanveer Anand, SRE Technical Program Manager at Google to discuss blameless post-mortems. We cover:🦅 The recent Crowdstrike outage and their public post-mortem🚑 When do we do a blameless post-mortem?😕 How do we do a blameless post-mortem?✅ How do we make sure action items are followed through?📰 The power of learning from post-mortems created by other teams and orgs...and much more.You can find Karanveer on LinkedIn: https://www.linkedin.com/in/karanveer/You can find Crowdstrike's preliminary post incident report here: https://www.crowdstrike.com/blog/falcon-content-update-preliminary-post-incident-report/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

Aug 27, 2024 • 27min
Slight Reliability Episode 88 - OpenTelemetry Revisited with Zach Michel
Send us a textThis week Zach Michel from https://middleware.io/ and I discuss the state of OpenTelemetry and what it means to adopt it. We cover:🌩️ Achieving observability in a SaaS world🥫 Context propagation - the magic sauce of OTEL🚪 The telemetry gateway concept and leveraging the OTEL collector🪵 The state of OpenTelemetry logging🫂 Making use of the OpenTelemetry community...and much more.You can find Zach on LinkedIn: https://www.linkedin.com/in/zamichel/You can find the official Slight Reliability podcast website at: https://slightreliability.com/For a list of ways to interact with the OpenTelemetry community go to:https://opentelemetry.io/community/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

Jul 24, 2024 • 36min
Slight Reliability Episode 87 - Measuring the value of SRE with Artem Yakimenko
Send us a textIn Episode 80 Niall Murphy talked about the need for SREs to be better at articulating the value of our work. In this episode I'm joined by ex-Googler and Engineering Director (SRE) at Culture Amp Artem Yakimenko about how we might achieve this.We discuss both quantifiable and qualitative approaches including leveraging the untapped data in support tickets, customer sentiment and rankings, the relationship between finance and performance, the link between user design and performance, and so much more.Books mentioned in the episode:100 Things Every Designer Needs to Know About PeopleBy Susan Weinschenkhttps://www.amazon.com.au/Things-Every-Designer-Needs-People/dp/0321767535You can find Artem on LinkedIn: https://www.linkedin.com/in/temikus/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

Jun 8, 2024 • 26min
Slight Reliability Episode 86 - Evolving SLOs with Dom Finn
Send us a textIn the world of SRE we constantly talk about defining SLOs, but what about evolving them over time? This week I chat with SRE Tech Lead Dom Finn about just that. We cover the relationship between reliability and user analytics, latency classes as a way to speak SLOs with business stakeholders, the role of NFRs and how the thresholds differ from SLOs, and much more.Books mentioned in the episode:The Beginning of Infinity: Explanations That Transform the WorldBy David Deutchhttps://www.amazon.com.au/Beginning-Infinity-Explanations-Transform-World/dp/0143121359Turn The Ship Around!By David Marquettehttps://davidmarquet.com/turn-the-ship-around-book/You can find Dom on LinkedIn: https://www.linkedin.com/in/dom-finn/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

May 2, 2024 • 11min
Slight Reliability Episode 85 - Feeling SaaSsy
Send us a textThis week I talk about the impact of SaaS-first technology strategies on the work of an SRE. I pose questions about observability, ownership, on-call, and how much control we have over reliability.You can find the Bleeding Tech blog on Medium: https://medium.com/@stownshendYou can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Mar 30, 2024 • 28min
Slight Reliability Episode 84 - Clinical Troubleshooting with Dan Slimmon
Send us a textThis week I chat with Dan Slimmon about applying the approach doctors use to treat patient symptoms during incident response.You can find Dan's blog at https://blog.danslimmon.com/ or connect with him on LinkedIn here: https://www.linkedin.com/in/danslimmon/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

Mar 5, 2024 • 31min
Slight Reliability Episode 83 - An Unfulfilled Promise with Itiel Shwartz
Send us a textThis week I hear about all things Kubernetes from Komodor CTO and co-founder Itiel Shwartz. We chat about the promise that was made when Kubernetes first entered the industry, the challenge of getting developers engaged and capable of working in Kubernetes, my hate/hate relationship with Helm but its important contribution to the Kubernetes project, Kubernetes observability, and so much more.You can find the Kubernetes for Humans podcast here:https://komodor.com/blog/the-kubernetes-for-humans-podcast/Or find out more about Komodor here:https://komodor.com/Or find Itiel on LinkedIn: https://www.linkedin.com/in/itiel-shwartz-18542853/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.