Heavybit Podcasts cover image

Ep. #7, The March 2023 Datadog Outage with Laura de Vesine

Heavybit Podcasts

00:00

The Role of the Cloud Providers in the Recovery of a Critical Latency Cliff

The outage affected about half of Google's Kubernetes fleet. The company uses a slightly different version of Celium for different cloud providers. On GCP in Azure, when a host stops responding, they let it sit there dead. And so we only had to restart all of the dead nodes in a coordinated order because of how we specifically manage our Kubernete fleet.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app