

Slight Reliability
Stephen Townshend
Learning SRE, one day at a time.
Episodes
Mentioned books

Oct 24, 2023 • 32min
Slight Reliability Episode 73 - Enterprise SLOs with Brian Singer
Send us a textThis week we sit down and talk about SLOs with CPO and co-founder of Nobl9 Brian Singer.We talk about the importance of reviewing operational effectiveness, getting buy in from leadership, using SLOs to reduce noise, how to implement SLOs within different cultures and structures, the parallels between security and reliability... and much more.You can check out Nobl9's reliability and SLO platform here: https://www.nobl9.com/You can find Brian on LinkedIn: https://www.linkedin.com/in/briantsinger/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

Oct 17, 2023 • 42min
Slight Reliability Episode 72 - Rapid Incident Response with Valeska Victoria
Send us a textThis week Stephen chats with Valeska Victoria about her time working as an SRE at eBay.Valeska shares her data driven approach to SRE, having a voice as a less experienced engineer, handling incidents under high pressure, leveraging large language models to rapidly find the information you need during an incident, and much more.You can check out PromptOps here: https://www.promptops.com/You can find Valeska on LinkedIn: https://www.linkedin.com/in/valeska-victoria/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

Oct 10, 2023 • 29min
Slight Reliability Episode 71 - Implementing SRE with Dr. Vlad Ukis
Send us a textThis week Stephen chats with Dr. Vlad Ukis about his journey discovering, and then implementing SRE practices at Siemens Healthineers (which led to him writing a book). They discuss how the evolution of infrastructure necessitates a shift in how we operate, the power of selling SRE practices, the SRE infrastructure used to build SLOs and reliability capabilities, how he implemented SLOs, and much more.You can find Vlad's book "Establishing SRE Foundations" here: https://www.amazon.com/Establishing-Foundations-Step-Step-Organizations/dp/0137424604 You can find Vlad on LinkedIn: https://www.linkedin.com/in/dr-vladyslav-ukis-5172ba32/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

Oct 3, 2023 • 42min
Slight Reliability Episode 70 - Meta SRE with Amin Astaneh
Send us a textAmin Astaneh (from Certo Modo) is back to discuss his experience working as a production engineer (SRE equivalent) at Meta.Stephen and Amin discuss what it's like interviewing for big tech, "you build it, you own it", different SRE engagement models, SRE at different sizes of organisation, socialising your SRE success as a way to get traction, and so much more.You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastanehThe books Amin mentions are...The Practice of Cloud System Administration: https://www.oreilly.com/library/view/practice-of-cloud/9780133478549/Leading Change:https://www.kotterinc.com/bookshelf/leading-change/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

Sep 26, 2023 • 30min
Slight Reliability Episode 69 - Developer to SRE with Praveen Kasam
Send us a textThis week Stephen talks to Praveen Kasam from Diconium Digital Solutions about how he led SRE transformations.Praveen shares his experience transitioning from development to SRE and how leveraging automation and bringing application knowledge to the ops team provided quick wins. He also covers how he later applied SRE concepts to uplift the wider organisation. If you are out there looking for advice on how to implement SRE in your organisation, this is the episode for you.You can find Praveen at:LinkedIn: https://www.linkedin.com/in/kasampraveen/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 19, 2023 • 33min
Slight Reliability Episode 68 - Dashboards and Modern Observability with Eric Schabell
Send us a textThis week Stephen asks Eric Schabell (Director of Technical Marketing & Evangelism @ Chronosphere) about how dashboards fit into modern observability.They discuss how untamed observability can lead to unexpectedly high cloud bills, the similarities between dashboards and documentation, the "know > triage > understand" workflow, and much more.You can find Eric at:LinkedIn: https://www.linkedin.com/in/ericschabell/X: https://twitter.com/ericschabell And you can find Chronosphere at: https://www.linkedin.com/company/chronosphereio/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 12, 2023 • 35min
Slight Reliability Episode 67 - Single Pane of Glass with Jamie Allen and Adam Kinniburgh
Send us a textThis week Stephen chats with Jamie Allen (Cheif Technologist AWS & SRE @ EPAM Systems) and Adam Kinniburgh (VP Innovation @ SquaredUp) about the concept of a single pane of glass (SPOG) for SRE.Is it performance art or something actionable? Can alerting replace the need for dashboards? And are metrics drowning in the wake of distributed tracing?You can find Jamie at:LinkedIn: https://www.linkedin.com/in/jlallen/And the Single Pain of Glass article he wrote here: https://medium.com/site-reliability-engineering-leadership/the-single-pain-of-glass-6e42930e966You can find EPAM at https://www.epam.com/And you can find the Google Dapper paper here: https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdfYou can find Adam at:LinkedIn: https://www.linkedin.com/in/adamkinniburgh/X: https://twitter.com/adamkinniburghYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 5, 2023 • 30min
Slight Reliability Episode 66 - Building Digital Assistants for SRE with Kyle Forster
Send us a textThis week Stephen brings back Kyle Forster from RunWhen to talk about the purple elephant in the room… “AI”. What makes it GenAI, LLM, Advanced Statistics, or ML? Kyle shares his experience surrounding building AI powered search engines for SRE troubleshooting commands and how to incorporate a (paid) open source community of experts rather than trust AI by itself. They discuss what search looks like under the hood, why GenAI powered chatbots will or won't take over the SaaS industry, how Digital Assistants can be utilised by SREs to increase productivity (hint: giving them to app developers!), how to make informed decisions when purchasing AI products, and *much* more. You can find Kyle at:LinkedIn: https://www.linkedin.com/in/kyforster/recent-activity/all/And you can find out more about RunWhen at: Website: https://www.runwhen.com/Product videos: https://www.youtube.com/@whatdoirunwhen RunWhen Local: https://github.com/runwhen-contrib/runwhen-local (RunWhen Local is an open source troubleshooting cheat sheet that suggests commands from the RunWhen community for all of the namespaces in your cluster - ready to copy & paste)You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 29, 2023 • 41min
Slight Reliability Episode 65 - The Truth About Incidents with Courtney Nash
Send us a textThis week Stephen chats with the internet incident librarian herself, Courtney Nash. They explore what Courtney has learned through meta-analysis of the over ten thousands incidents in the Verica Open Incident Database (VOID). They cover why MTTR needs to go in the garbage, joint cognitive systems, the value of looking at near misses and *much* more.You can check out the VOID here: https://www.thevoid.community/The two papers mentioned are:Ironies of Automation by Lisanne Bainbridge: https://queue.acm.org/detail.cfm?id=3380779Managing the Hidden Costs of Coordination by Laura Maguire: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfYou can find Courtney at:LinkedIn: https://www.linkedin.com/in/nashcourtney/Twitter: https://twitter.com/courtneynashYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 22, 2023 • 36min
Slight Reliability Episode 64 - Observability During Development with Martin Thwaites
Send us a textThis week Stephen chats with Martin Thwaites from Honeycomb about how developers can leverage observability to understand what they're building better, solve bugs quicker, and have more time for coding. They also discuss OpenTelemetry (the protocol and semantic conventions), manual versus automatic instrumentation, and how keeping every span of trace data is irresponsible.You can find Martin at:LinkedIn: https://www.linkedin.com/in/martin-thwaites-ab445120/X: https://twitter.com/MartinDotNetAnd Honeycomb at https://www.honeycomb.io/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre