
Ep. #7, The March 2023 Datadog Outage with Laura de Vesine
Heavybit Podcasts
00:00
How to Determine a Safe Throughput for a Cluster Recovery
In terms of restarting the actual clusters, we established a safe throughput that we then realized wasn't quite accurate. We definitely sped up as we went along. The first cluster that we recovered went pretty slowly. And then we wrote some scripts to improve that. Some of the tooling that we would normally use to do these kinds of operations at scale was also down because it runs on our infrastructure. So folks threw together some bash scripts and went as quickly as they could.
Transcript
Play full episode