Heavybit Podcasts cover image

Ep. #7, The March 2023 Datadog Outage with Laura de Vesine

Heavybit Podcasts

00:00

How to Determine a Safe Throughput for a Cluster Recovery

In terms of restarting the actual clusters, we established a safe throughput that we then realized wasn't quite accurate. We definitely sped up as we went along. The first cluster that we recovered went pretty slowly. And then we wrote some scripts to improve that. Some of the tooling that we would normally use to do these kinds of operations at scale was also down because it runs on our infrastructure. So folks threw together some bash scripts and went as quickly as they could.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app