2.5 Admins

2.5 Admins 257: Outage365

14 snips
Jul 24, 2025
The podcast dives into the recent 19-hour Outlook outage, revealing the fragility of cloud services and how company culture affects accountability. It discusses the introduction of IP address certificates by Let's Encrypt, exploring their impact on security and HTTPS encryption. Insights on optimizing ZFS replication and the importance of diversity in backup systems are shared. The nuances of disaster recovery and hot spare setups are also examined, highlighting critical strategies for enhancing server resilience.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Microsoft's 19-Hour Outlook Outage

  • Microsoft had a major Outlook outage lasting 19 hours caused by accidental simultaneous certificate deletion.
  • Transparency on the incident was poor, with minimal official details given and burden on users to search for answers.
INSIGHT

Cloudflare's Rapid Outage Recovery

  • Cloudflare's 1.1.1.1 outage lasted just over an hour despite a complex configuration error.
  • Their detailed public postmortem and swift response reflect a company culture focused on openness and accountability.
INSIGHT

BGP Hijack During Outage

  • BGP hijacks can cause IP address traffic to route incorrectly during outages.
  • Such hijacks may be unintended side effects of configuration changes rather than causes of outages.
Get the Snipd Podcast app to discover more snips from this episode
Get the app