PreAccident Investigation Podcast

PAPod 405 - Ryan Kitchens and Big System Resiliance - Part One

8 snips
Aug 20, 2022
Ryan Kitchens, a Senior Systems Engineer at Netflix, dives into the intricacies of incident response and learning from failures in software. He discusses the importance of keeping risk conversations alive even during calm periods and critiques the simplistic tools often used for problem-solving. Ryan highlights the challenges of onboarding new leaders to embrace deeper learning methods and warns against action items that can inadvertently create future incidents. He also addresses the pervasive issue of attrition and dark debt within tech teams, advocating for dynamic documentation to mitigate hidden risks.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Incidents Are Inevitable In Complex Systems

  • Incidents in software are constant and often stem from a system's ongoing success rather than unique defects.
  • Treat failures as inevitable system behavior instead of one-off surprises to design tolerant systems.
ADVICE

Onboard Leaders Into Learning Practices

  • Onboard new leaders explicitly into your learning practices so they understand nontraditional incident approaches.
  • Bridge cultural gaps by teaching first principles instead of assuming shared familiarity.
ADVICE

Detect Weak Signals With Near-Miss Programs

  • Actively hunt weak signals and near-misses with programs like 'Oops' to surface operational surprises before they become outages.
  • Invest up-front in learning capacity; justify it as resilience, not immediate measurable output.
Get the Snipd Podcast app to discover more snips from this episode
Get the app