
alphalist.CTO Podcast - For CTOs and Technical Leaders
#113 - Faster Incident Response feat. Tim Armandpour // CTO @ PagerDuty
Plan and PRACTICE for better incident response with insights from Tim Armandpour, CTO of PagerDuty. Learn the secrets to resilience from the team that mitigated the impact of a major outage—handling a 250% traffic surge while delivering on their SLA.
Listen to find out:
- 🛠️ Why planning AND practice are both critical for incident response.
- 🚧 How to practice for incident response (e.g Failure Fridays with Chaos Engineering)
- 🧑🤝🧑 Ownership: Why tech AND business teams must join post-mortems.
- ☁️ How to mitigate the impact of your cloud provider’s lower SLA.
- ⚓ Which architectural patterns are more resilient?
- ⚖️ WARNING: “bend” the CAP theorem at your own risk
Listen here
TimeStamps: (00:00:00) Introduction to Alphalist Podcast (00:01:00) Meet Tim Armanpour (00:01:56) Tim's Early Career (00:06:22) Handling Major Incidents at PagerDuty (00:09:21) The Importance of Preparedness (00:13:54) Practicing Failure Scenarios (00:18:16) Resilient Infrastructure and Architectural Patterns (00:22:44) Standardization and Data Management (00:25:48) Exploring Infrastructure Resilience (00:26:20) Achieving High Availability with Lower SLA Cloud Platforms (00:29:38) Defining Meaningful SLIs (00:32:15) Assessing Incident Readiness (00:35:15) The Importance of Ownership (00:41:30) Continuous Improvement (00:43:53) Lessons from a Yogurt Business (00:48:18) Final Thoughts and Takeaways