Fallthrough

The Fault In Our Clouds

Nov 4, 2025
The hosts dive into the surge of cloud outages from major providers like AWS and Azure, analyzing the causes and implications. They discuss the intricacies of DNS failures and the cascading effects on service performance. Debates ensue over the risks of centralized systems and the need for localized alternatives. Insightful critiques of cloud incident reports are shared, along with thoughts on the value of AI-assisted code reviews. With a mix of humor and technical depth, they explore whether traditional big tech careers are still worth pursuing.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Planner/Executor Race Caused DNS Disaster

  • AWS's outage stemmed from DNS planner/executor race conditions that deleted DynamoDB records and cascaded to EC2.
  • Distributed design trade-offs (coordination vs performance) made this failure mode plausible at scale.
INSIGHT

TOCTOU And Garbage Collection Collided

  • Time-of-check/time-of-use (TOCTOU) issues and garbage collection combined to delete needed DNS state.
  • Stronger coordination or atomic operations would avoid this but add latency and complexity.
ADVICE

Expire Stale Plans With Timeouts

  • Add sensible timeouts or plan expiry so stale enactors don't apply old plans after long latency.
  • Prefer failure-domain limiting (timeouts) to risking global deletions when state is outdated.
Get the Snipd Podcast app to discover more snips from this episode
Get the app