Tech Brew Ride Home

(BNS) The Crowdstrike Thing (With Overmind.tech)

5 snips
Aug 24, 2024
Dylan Ratcliffe, founder of Overmind.tech, dives into the intricacies of preventing outages in complex systems. He breaks down the recent CrowdStrike incident caused by a kernel-mode software error, showcasing the challenges of software updates. Ratcliffe highlights the hidden risks of configuration changes in cybersecurity, stressing their often-overlooked dangers. He also critiques vendor communication during crises and shares strategies for effective risk analysis and dependency awareness to navigate production challenges, offering valuable insights for better operational stability.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

CrowdStrike Outage Explained

  • CrowdStrike's outage involved a simple error: reading past an array's bounds.
  • This seemingly minor mistake had a catastrophic impact due to the software running in kernel mode.
INSIGHT

Testing Discrepancy

  • CrowdStrike's testing process for standard updates was robust, but rapid response updates followed a different, less rigorous path.
  • This discrepancy in testing procedures contributed to the outage's severity.
INSIGHT

Misplaced Confidence

  • CrowdStrike's confidence in deploying the faulty update stemmed from prior testing and similar deployments.
  • However, the testing was not recent enough, and the deployments weren't identical, highlighting a gap in their risk assessment.
Get the Snipd Podcast app to discover more snips from this episode
Get the app