The InfoQ Podcast

Architecture Should Model the World as It Really Is: A Conversation with Randy Shoup

11 snips
Nov 10, 2025
In a compelling discussion, Randy Shoup, a seasoned distributed-systems architect with experience at eBay and Google, delves into the importance of learning from software failures. He advocates for blameless postmortems to foster culture and resilience. Randy shares practical strategies, like a five-step postmortem framework for understanding outages. He emphasizes modeling real-world asynchronous systems through workflows and events for better reliability, and discusses how shared trauma can enhance team cohesion in the wake of failures.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
ADVICE

Use Five Questions After Every Incident

  • After an incident, ask how to detect, diagnose, mitigate, remediate, and prevent the issue.
  • Use that five-part framing to move beyond proximate causes to durable system improvements.
ANECDOTE

Massive Outage Led To Systemwide Reliability Push

  • In 2012 Randy Shoup led a six-month effort after an eight-hour Google App Engine outage to fix reliability broadly, not just the immediate bug.
  • The team brainstormed hundreds of issues, prioritized them, and reduced reliability incidents by tenfold.
ADVICE

Turn Brainstorms Into Prioritized Projects

  • Brainstorm widely then cluster and assign themes to small teams for focused synthesis.
  • Convert ideas into a prioritized list with rough effort estimates and iterate in priority order.
Get the Snipd Podcast app to discover more snips from this episode
Get the app