
Architecture Should Model the World as It Really Is: A Conversation with Randy Shoup
11 snips
Nov 10, 2025 In a compelling discussion, Randy Shoup, a seasoned distributed-systems architect with experience at eBay and Google, delves into the importance of learning from software failures. He advocates for blameless postmortems to foster culture and resilience. Randy shares practical strategies, like a five-step postmortem framework for understanding outages. He emphasizes modeling real-world asynchronous systems through workflows and events for better reliability, and discusses how shared trauma can enhance team cohesion in the wake of failures.
AI Snips
Chapters
Books
Transcript
Episode notes
Use Five Questions After Every Incident
- After an incident, ask how to detect, diagnose, mitigate, remediate, and prevent the issue.
- Use that five-part framing to move beyond proximate causes to durable system improvements.
Massive Outage Led To Systemwide Reliability Push
- In 2012 Randy Shoup led a six-month effort after an eight-hour Google App Engine outage to fix reliability broadly, not just the immediate bug.
- The team brainstormed hundreds of issues, prioritized them, and reduced reliability incidents by tenfold.
Turn Brainstorms Into Prioritized Projects
- Brainstorm widely then cluster and assign themes to small teams for focused synthesis.
- Convert ideas into a prioritized list with rough effort estimates and iterate in priority order.



