Chapters
Transcript
Episode notes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Introduction
00:00 • 5min
The Incident: A Cognitive Interview to Help Laura Recreate What Happened
04:55 • 1min
The Impact of the Data Dog Outage on the Industry
06:21 • 3min
The Importance of Public Incident Reviews
09:40 • 5min
The Impact of Out of Band Monitoring on Engineering
14:50 • 2min
The Importance of Alerting in an Engineering Incident
16:26 • 3min
The Nine Senior Leaders and 70 People Involved in This Incident
19:34 • 4min
Setting Priorities in an All-I-Go Situation
23:30 • 4min
The Impact of a Global Event on Kubernetes Infrastructure
27:27 • 3min
The Underlying Root Cause of the Problem in Ubuntu
30:55 • 3min
How to Roll Fleet Death With Ubuntu Default Security Updates
33:53 • 2min
How Winton Connected the Problem With Updates in Kubernetes
35:26 • 2min
The History of System Updates
37:52 • 2min
The Role of the Cloud Providers in the Recovery of a Critical Latency Cliff
40:09 • 4min
How to Determine a Safe Throughput for a Cluster Recovery
43:53 • 6min
How to Communicate With Customers in an Outage
49:27 • 3min
The Importance of Prioritizing Live Data
52:43 • 5min
The Future of Data Dog
57:47 • 2min