
Ep. #7, The March 2023 Datadog Outage with Laura de Vesine
Heavybit Podcasts
00:00
The Role of the Cloud Providers in the Recovery of a Critical Latency Cliff
The outage affected about half of Google's Kubernetes fleet. The company uses a slightly different version of Celium for different cloud providers. On GCP in Azure, when a host stops responding, they let it sit there dead. And so we only had to restart all of the dead nodes in a coordinated order because of how we specifically manage our Kubernete fleet.
Transcript
Play full episode