

(BNS) The Crowdstrike Thing (With Overmind.tech)
5 snips Aug 24, 2024
Dylan Ratcliffe, founder of Overmind.tech, dives into the intricacies of preventing outages in complex systems. He breaks down the recent CrowdStrike incident caused by a kernel-mode software error, showcasing the challenges of software updates. Ratcliffe highlights the hidden risks of configuration changes in cybersecurity, stressing their often-overlooked dangers. He also critiques vendor communication during crises and shares strategies for effective risk analysis and dependency awareness to navigate production challenges, offering valuable insights for better operational stability.
AI Snips
Chapters
Transcript
Episode notes
CrowdStrike Outage Explained
- CrowdStrike's outage involved a simple error: reading past an array's bounds.
- This seemingly minor mistake had a catastrophic impact due to the software running in kernel mode.
Testing Discrepancy
- CrowdStrike's testing process for standard updates was robust, but rapid response updates followed a different, less rigorous path.
- This discrepancy in testing procedures contributed to the outage's severity.
Misplaced Confidence
- CrowdStrike's confidence in deploying the faulty update stemmed from prior testing and similar deployments.
- However, the testing was not recent enough, and the deployments weren't identical, highlighting a gap in their risk assessment.