The Fault In Our Clouds

Nov 4, 2025

The hosts dive into the surge of cloud outages from major providers like AWS and Azure, analyzing the causes and implications. They discuss the intricacies of DNS failures and the cascading effects on service performance. Debates ensue over the risks of centralized systems and the need for localized alternatives. Insightful critiques of cloud incident reports are shared, along with thoughts on the value of AI-assisted code reviews. With a mix of humor and technical depth, they explore whether traditional big tech careers are still worth pursuing.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Planner/Executor Race Caused DNS Disaster

AWS's outage stemmed from DNS planner/executor race conditions that deleted DynamoDB records and cascaded to EC2.
Distributed design trade-offs (coordination vs performance) made this failure mode plausible at scale.

INSIGHT

TOCTOU And Garbage Collection Collided

Time-of-check/time-of-use (TOCTOU) issues and garbage collection combined to delete needed DNS state.
Stronger coordination or atomic operations would avoid this but add latency and complexity.

ADVICE

Expire Stale Plans With Timeouts

Add sensible timeouts or plan expiry so stale enactors don't apply old plans after long latency.
Prefer failure-domain limiting (timeouts) to risking global deletions when state is outdated.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

First it was GCP in June. Then it was AWS in October. Then it was Azure a week later. It seems that our cloud providers are having outages far more often, and for far longer, than any of us would like. In this episode, Kris, Ian, and Matthew discuss the two most recent outages along with some of their thoughts on the current state of the industry and the future of software.

We continue this discussion in this week's episode of Break! The panel talks about whether seeking a career with a FAANG company is worth it anymore, why building software for your local community is important, and their frustrations with point of sale systems. Watch it on YouTube or listen with your favorite podcasting app! Learn more by going to https://break.show.

EXTRA! EXTRA! There's lots of bonus content in this episode! And if you're a supporter you're getting all of it. In this week's extra chapters the panel talks about whether we all need to be on large cloud providers, frustrations with food delivery app PINs, whether timeouts and retries should be our go to, and why it feels like software is constantly getting worse. Not a supporter yet? Fix that today by heading over to https://fallthrough.fm/subscribe where you'll get not only extra content but also higher quality audio. Sign up today!

Thanks for tuning in and happy listening!

Show Notes:

AWS Outage Summary: https://aws.amazon.com/message/101925/
Azure Outage Summary: https://azure.status.microsoft/en-us/status/history/

Table of Contents:

Prologue (00:00:00)
Chapter 1: The AWS Outage (00:03:03)
Chapter 2: Overdependence on Timeouts and Retries [Extended] (00:27:15)
Chapter 3: Food Delivery app PINs should be Local First [Extended] (00:27:41)
Chapter 4: The Azure Outage (00:28:11)
Chapter 5: Do We Actually Need All These Cloud Services? [Extended] (00:39:37)
Chapter 6: We Are Trapped By Our Own Path Dependence [Extended] (00:40:07)
Chapter 7: What Is Popular Is Not Necessarily What Is Good (00:40:54)
Appendix UNPOP: Unpopular Opinions and Panic & Recover (00:42:42)
Epilogue (01:02:34)

Hosts

Socials:

(00:00) - Prologue
(03:03) - Chapter 1: The AWS Outage
(27:15) - Chapter 2: Overdependence on Timeouts and Retries [Extended]
(27:41) - Chapter 3: Food Delivery app PINs should be Local First [Extended]
(28:11) - Chapter 4: The Azure Outage
(39:37) - Chapter 5: Do We Actually Need All These Cloud Services? [Extended]
(40:07) - Chapter 6: We Are Trapped By Our Own Path Dependence [Extended]
(40:54) - Chapter 7: What Is Popular Is Not Necessarily What Is Good
(42:42) - Appendix UNPOP: Unpopular Opinions and Panic & Recover
(01:02:34) - Epilogue