

Slight Reliability
Stephen Townshend
Learning SRE, one day at a time.
Episodes
Mentioned books

Oct 7, 2025 • 23min
Team Topologies with Luke McManus (Episode 107)
Send us a textWhat are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)?This week I'm joined by Luke McManus to discuss...⛰️ What are the four team topologies?🏆 Can we have too much collaboration?⌚ Team interaction models🌏 Cognitive load🏃♀️➡️ Value dynamics mapping...and much more.You can find Luke on:LinkedIn: https://www.linkedin.com/in/luke-mcmanus-agile/Check out the recently released second edition of the Team Topologies book by Matthew Skelton and Manual Pais here: https://itrevolution.com/product/team-topologies-second-edition/Or Unbundling the Enterprise by Stephen Fishman and Matt McLarty here: https://itrevolution.com/product/unbundling-the-enterprise/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 23, 2025 • 44min
Contributing to Open Source with Wendy Ha (Episode 106)
Send us a textHow do you begin contributing to an open source project? What's it like? What do you get out of it?This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore...⛰️ What it's like working on one of the biggest open source projects in the world🏆 The benefits of contributing to open source⌚ How much time and effort does it take?🌏 The unique challenges of contributing from APAC (and the need for more contributors in Australia and New Zealand)🏃♀️➡️ How to get started...and much more.You can find Wendy on:LinkedIn: https://www.linkedin.com/in/wendyha-sut/Ways you can get started contributing to Open Source:CNCF from Zero to Merge Program: https://project.linuxfoundation.org/cncf-zero-to-merge-applicationLFX Mentorship Program: https://mentorship.lfx.linuxfoundation.org/#projects_allOutreachy Mentorship Program: https://www.outreachy.org/mentor/Google Summer of Code: https://summerofcode.withgoogle.com/Kubernetes Release Team Shadowing: https://github.com/kubernetes/sig-release/blob/master/release-team/README.md#release-team-shadowYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 9, 2025 • 28min
Influencing Leadership with Nora Jones (Episode 105)
Send us a textAs an #SRE how do you influence senior leadership to get support and priority for the things you care about?To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDuty. Our conversation touches on...🤝 How understanding needs to flow both ways (between engineers and leaders)🎨 Reliability is as much an art as a science📝 Using napkin math to start conversations🧠 Understand the system (your org) before trying to change it💬 Using micro-interactions to gradually implement change...and so much more.You can find Nora on:LinkedIn: https://www.linkedin.com/in/norajones1/You can find more about PagerDuty here: https://www.pagerduty.com/nlp/trial-sign-up/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 26, 2025 • 27min
Slight Reliability Podcast Retrospective (Episode 104)
Send us a textThis week I do a retrospective on the Slight Reliability podcast.👂 How many people listen to it?❤️ How do I feel about the show?🎉 What's going well?🪴 What could be better?❔ What's next for the show?If you want to check out the podcast that came before Slight Reliability, you can find Performance Time archived on YouTube here:https://www.youtube.com/@performance-timeYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 12, 2025 • 39min
Burnout with Colette Alexander (Episode 103)
Send us a textHave you burned out at work? What was your experience? How did you work through it?This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out at work. We cover...🔥 What is burnout?❓ Why does it happen?🫀 What are the symptoms?🥊 Fight, flight, or freeze🧑🚒 Advice on how to recover...and much more.Resources from the show...Why you're so angry at work (and what to do about it) by Natalie Rothfels https://www.lennysnewsletter.com/p/why-youre-so-angry-at-workBurnout (book) by Amelia and Emily Nagoski https://www.burnoutbook.net/ How to do nothing (book) by Jenny Odell https://www.penguinrandomhouse.com/books/600671/how-to-do-nothing-by-jenny-odell/You can find Colette on:LinkedIn: https://www.linkedin.com/in/colette-alexander-4168267/You can find the This Is Fine! podcast here: https://www.thisisfinepod.com/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Jul 29, 2025 • 32min
Mobile Observability with Hanson Ho (Episode 102)
Send us a textThis week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover...📱 The mobile/backend observability divide✍️ The challenge of distributed tracing on mobile apps🌏 The entire device runtime environment matters for your app👤 The quest for user-centric mobile observability✅ Advice on how to get started with mobile observability...and much more.You can find Hanson on:LinkedIn: https://www.linkedin.com/in/hanson-ho/Bluesky: https://bsky.app/profile/bidetofevil.wtfYou can find out more about Embrace at https://embrace.io/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Jul 15, 2025 • 40min
Intro to Resilience Engineering with Michelle Casey (Episode 101)
Send us a textThis week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover...🏋️♀️ Reliability VS Robustness VS Resilience🧩 What is a complex system?🔢 Safety one/safety two🧠 Mental models😩 Human error...and so much more.Resources from this episode:Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implications_for_the_future_of_resilience_engineeringBuilding and revising adaptive capacity sharing for technical incident response (paper) by Dr Richard Cook and Dr Beth Long https://www.researchgate.net/publication/344259449_Building_and_revising_adaptive_capacity_sharing_for_technical_incident_response_A_case_of_resilience_engineeringSystems Thinking for Incident Analysis (talk) by Laura Nolan from LFI Conf 23 https://www.youtube.com/watch?v=-uXGg3g2ypsHow Complex Systems Fail (website) by Dr. Richard Cook https://how.complexsystems.fail/A Tale of Two Safeties (book) by Erik Hollnagel https://erikhollnagel.com/A Tale of Two Safeties.pdfFrom Safety One to Safety Two (book) by Erik Hollnagel https://www.england.nhs.uk/signuptosafety/wp-content/uploads/sites/16/2015/10/safety-1-safety-2-whte-papr.pdfResilience: It's not you, it's the System (talk) by Dr Carl Horsley https://www.youtube.com/watch?v=ugC3GTKt23UAbove the line / Below the line (paper) by Dr Richard Cook (not original link) https://www.researchgate.net/figure/Above-the-Line-Below-the-Line-framework-adapted-with-permission-Cook-Woods-2016_fig3_333091997How Your Systems Keep Running Day After Day (talk) by John Allspaw https://www.youtube.com/watch?v=xA5U85LSk0MBehind Human Error (book) https://www.amazon.com.au/Behind-Human-Error-David-Woods/dp/0754678342The Field Guide to Human Error Investigations (book) by Sydney Dekker https://www.humanfactors.lth.se/fileadmin/lusa/Sidney_Dekker/books/DekkersFieldGuide.pdfThe Howie Guide (paper) by Dr Laura Maguire, Nora Jones and Vanessa Granda https://howie-guide.pagerduty.com/Resilience Engineering: Where do I start? (website) by Lorin Hochstein https://www.resilience-engineering-association.org/resources/where-do-i-start/The STELLA report (paper) https://snafucatchers.github.io/DORA Communtiy Discussion - Resilience Engineering (discussion) https://www.youtube.com/watch?v=g3cEJ7njJbcThis Is Fine! (podcast) by Colette Alexander and Clint Byrum https://www.thisisfinepod.com/the-pod

Jun 24, 2025 • 48min
Learning with John Allspaw (Episode 100)
John Allspaw, co-founder of Adaptive Capacity Labs and former CDO at Etsy, dives into the essential art of learning from incidents. He challenges the notion of perfect handovers, revealing why traditional incentives fail to eliminate errors. The talk shifts to the importance of embracing organizational learning and understanding incidents as indicators of systemic issues. Allspaw also champions resilience engineering in software development, urging a community-focused approach to foster adaptability and insight in chaotic environments.

Jun 3, 2025 • 29min
Focusing on What Matters with Trent Hornibrook (Episode 99)
Send us a textThis week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, and beyond. We discuss...🔌 Empowering engineers to implement change in your org🧑🍼 Focusing on what matters (customer & business > technology)👀 Not just adding more monitoring as the output of each PIR😎 How autonomy can lead to accountability🌳 How to influence change in an organisation...and much more.You can find Trent on:LinkedIn: https://www.linkedin.com/in/trenthornibrook/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

May 20, 2025 • 32min
The Root Cause Fallacy with Andrew Hatch (Episode 98)
Send us a textThis week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore...🌌 Is the root cause of every incident the big bang?🦖 How the value of root cause degrades as complexity increases🫣 That if the culture is not blameless, people will hide things🌳 Alternative approaches to root cause analysis such as branching timelines🙋 Getting someone without skin in the game to facilitate your blameless post-mortems...and much more.You can find Andrew on:LinkedIn: https://www.linkedin.com/in/hatchman76/Check out Andrew's SREcon21 talk 'Learning from Complex Systems' which covers many of the topics introduced in this episode: https://www.youtube.com/watch?v=5pKGW61RyvoYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre