
Fallthrough The Fault In Our Clouds
Nov 4, 2025
The hosts dive into the surge of cloud outages from major providers like AWS and Azure, analyzing the causes and implications. They discuss the intricacies of DNS failures and the cascading effects on service performance. Debates ensue over the risks of centralized systems and the need for localized alternatives. Insightful critiques of cloud incident reports are shared, along with thoughts on the value of AI-assisted code reviews. With a mix of humor and technical depth, they explore whether traditional big tech careers are still worth pursuing.
AI Snips
Chapters
Transcript
Episode notes
Planner/Executor Race Caused DNS Disaster
- AWS's outage stemmed from DNS planner/executor race conditions that deleted DynamoDB records and cascaded to EC2.
- Distributed design trade-offs (coordination vs performance) made this failure mode plausible at scale.
TOCTOU And Garbage Collection Collided
- Time-of-check/time-of-use (TOCTOU) issues and garbage collection combined to delete needed DNS state.
- Stronger coordination or atomic operations would avoid this but add latency and complexity.
Expire Stale Plans With Timeouts
- Add sensible timeouts or plan expiry so stale enactors don't apply old plans after long latency.
- Prefer failure-domain limiting (timeouts) to risking global deletions when state is outdated.
