

Slight Reliability
Stephen Townshend
Learning SRE, one day at a time.
Episodes
Mentioned books

Sep 19, 2023 • 33min
Slight Reliability Episode 68 - Dashboards and Modern Observability with Eric Schabell
Send us a textThis week Stephen asks Eric Schabell (Director of Technical Marketing & Evangelism @ Chronosphere) about how dashboards fit into modern observability.They discuss how untamed observability can lead to unexpectedly high cloud bills, the similarities between dashboards and documentation, the "know > triage > understand" workflow, and much more.You can find Eric at:LinkedIn: https://www.linkedin.com/in/ericschabell/X: https://twitter.com/ericschabell And you can find Chronosphere at: https://www.linkedin.com/company/chronosphereio/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 12, 2023 • 35min
Slight Reliability Episode 67 - Single Pane of Glass with Jamie Allen and Adam Kinniburgh
Send us a textThis week Stephen chats with Jamie Allen (Cheif Technologist AWS & SRE @ EPAM Systems) and Adam Kinniburgh (VP Innovation @ SquaredUp) about the concept of a single pane of glass (SPOG) for SRE.Is it performance art or something actionable? Can alerting replace the need for dashboards? And are metrics drowning in the wake of distributed tracing?You can find Jamie at:LinkedIn: https://www.linkedin.com/in/jlallen/And the Single Pain of Glass article he wrote here: https://medium.com/site-reliability-engineering-leadership/the-single-pain-of-glass-6e42930e966You can find EPAM at https://www.epam.com/And you can find the Google Dapper paper here: https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdfYou can find Adam at:LinkedIn: https://www.linkedin.com/in/adamkinniburgh/X: https://twitter.com/adamkinniburghYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 5, 2023 • 30min
Slight Reliability Episode 66 - Building Digital Assistants for SRE with Kyle Forster
Send us a textThis week Stephen brings back Kyle Forster from RunWhen to talk about the purple elephant in the room… “AI”. What makes it GenAI, LLM, Advanced Statistics, or ML? Kyle shares his experience surrounding building AI powered search engines for SRE troubleshooting commands and how to incorporate a (paid) open source community of experts rather than trust AI by itself. They discuss what search looks like under the hood, why GenAI powered chatbots will or won't take over the SaaS industry, how Digital Assistants can be utilised by SREs to increase productivity (hint: giving them to app developers!), how to make informed decisions when purchasing AI products, and *much* more. You can find Kyle at:LinkedIn: https://www.linkedin.com/in/kyforster/recent-activity/all/And you can find out more about RunWhen at: Website: https://www.runwhen.com/Product videos: https://www.youtube.com/@whatdoirunwhen RunWhen Local: https://github.com/runwhen-contrib/runwhen-local (RunWhen Local is an open source troubleshooting cheat sheet that suggests commands from the RunWhen community for all of the namespaces in your cluster - ready to copy & paste)You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 29, 2023 • 41min
Slight Reliability Episode 65 - The Truth About Incidents with Courtney Nash
Send us a textThis week Stephen chats with the internet incident librarian herself, Courtney Nash. They explore what Courtney has learned through meta-analysis of the over ten thousands incidents in the Verica Open Incident Database (VOID). They cover why MTTR needs to go in the garbage, joint cognitive systems, the value of looking at near misses and *much* more.You can check out the VOID here: https://www.thevoid.community/The two papers mentioned are:Ironies of Automation by Lisanne Bainbridge: https://queue.acm.org/detail.cfm?id=3380779Managing the Hidden Costs of Coordination by Laura Maguire: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfYou can find Courtney at:LinkedIn: https://www.linkedin.com/in/nashcourtney/Twitter: https://twitter.com/courtneynashYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 22, 2023 • 36min
Slight Reliability Episode 64 - Observability During Development with Martin Thwaites
Send us a textThis week Stephen chats with Martin Thwaites from Honeycomb about how developers can leverage observability to understand what they're building better, solve bugs quicker, and have more time for coding. They also discuss OpenTelemetry (the protocol and semantic conventions), manual versus automatic instrumentation, and how keeping every span of trace data is irresponsible.You can find Martin at:LinkedIn: https://www.linkedin.com/in/martin-thwaites-ab445120/X: https://twitter.com/MartinDotNetAnd Honeycomb at https://www.honeycomb.io/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 15, 2023 • 9min
Slight Reliability Episode 63 - The Power of Summary
Send us a textObservability is a necessary adaptation to make sense of software systems in the Digital Age, but how can we unlock its power for non-engineer stakeholders (such as executives, product owners, etc)? Perhaps we need a layer of abstraction sitting on top of our detailed observability to get the most out of it.You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 1, 2023 • 37min
Slight Reliability Episode 62 - On-Call with Matt Brown
Send us a textThis week Stephen chats with former-Google SRE Matt Brown about being on-call. They cover how to up-lift junior engineers so they can be on-call, what a fair on-call schedule looks like, run-books, and much more.As you heard, Matt believes flexibility is key to a healthy on-call rotation. Matt is exploring ideas for improvements to existing tooling and products in this space and would love to hear from as many listeners as possible with feedback on what they find useful or frustrating with the existing tools they use to support on-call in their teams. You can reach him at oncall-feedback@mkmba.nz or schedule a chat via https://zcal.co/mattb/oncall, please don't be shy!You can also find Matt at:Website: https://www.mattb.nz/LinkedIn: https://www.linkedin.com/in/mattbrown/Mastodon: https://mastodon.nz/@mattbTwitter: https://twitter.com/xleemYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Jul 25, 2023 • 6min
Slight Reliability Episode 61 - SRE VS DevOps VS Platform Eng... (Yawn)
Send us a textThe internet is full of people who want to tell you about SRE, DevOps, and Platform Engineering and how different and similar they are... and will give you the impression that these things compete with each other. But do they? And is it a helpful question to ask in the first place?You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Jul 11, 2023 • 43min
Slight Reliability Episode 60 - From Zero to SRE with Amin Astaneh
Send us a textIn this episode Amin Astaneh from Certo Modo discusses his experience undertaking an SRE transformation over several years.Stephen and Amin cover a lot of ground including making ops work visible, measuring toil, the power of calculating the $ value of work, getting developers on-call, the embedded model for SRE, SLOs, culture change, and a whole lot more.You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastanehThe books Amin mentions are...The Practice of Cloud System Administration: https://www.oreilly.com/library/view/practice-of-cloud/9780133478549/The Phoenix Project: https://www.oreilly.com/library/view/the-phoenix-project/9781457191350/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

Jul 4, 2023 • 40min
Slight Reliability Episode 59 - Bad API Observability with Sonja Chevre
Send us a textIn this episode Stephen Townshend and Sonja Chevre from Tyk discuss making APIs observable, and some anti-patterns to avoid. They cover GraphQL, OpenTelemetry and semantic conventions, correlation IDs, observability pipelines, and much more.You can find Sonja on LinkedIn: https://www.linkedin.com/in/sonjachevre/ and Twitter: https://twitter.com/SonjaChevreYou can listen to Sonja's KubeCon talk here: https://youtu.be/IkEUJjRBCboYou can find Tyk's open source gateway here: https://github.com/TykTechnologies/tykYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/