
Looking for Root Causes is a False Path: A Conversation with David Blank-Edelman
Dec 1, 2025
David Blank-Edelman, a leading figure in site reliability engineering (SRE) with nearly 40 years of operations experience, dives into the intricate relationship between software architecture and SRE. He challenges the conventional idea of seeking root causes for failures, emphasizing instead the importance of understanding what works in a system. The discussion highlights designing for reliability, embracing emergent properties, and learning from successes as pivotal to improving system resilience and collaboration between architects and SREs.
AI Snips
Chapters
Books
Transcript
Episode notes
Reliability Is An Emergent System Property
- Reliability is an emergent property that architects and SREs must jointly consider across latency, throughput, durability and availability.
- Good architecture anticipates reliability trade-offs and evolves by design, not as a one-off deliverable.
Design For Observability And Feedback
- Build instrumentation and feedback loops so you can observe how the system behaves in production over time.
- Ask how you'll get signals that tell you when the system is succeeding or degrading before you need to debug an outage.
Avoid Root Cause Hunting; Fix Contributing Factors
- Stop chasing a single 'root cause' and instead document triggers and all contributing factors, including sociotechnical ones.
- Fix contributing factors (bad docs, missing runbooks, tooling gaps) to reduce repeat outages rather than blaming individuals.




