The New Stack Podcast

How etcd Solved Its Knowledge Drain with Deterministic Testing

9 snips
Dec 5, 2025
Marek Szakowicz, lead maintainer of etcd and senior software engineer at Google, discusses tackling the critical challenges faced by the etcd project due to maintainer turnover and lost knowledge. He dives into the innovative approaches the team adopted, such as robustness testing inspired by Jepsen to ensure system correctness. Marek also details the collaboration with Antithesis to implement deterministic simulation testing, which enables reproducible outcomes and enhances reliability. His insights underscore the importance of rigorous testing in open source projects.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Maintainer Turnover Broke Testing Knowledge

  • As maintainers left, crucial unwritten testing knowledge drained from the etcd project.
  • That knowledge loss produced a release with critical reliability issues, including potential data inconsistencies after crashes.
INSIGHT

Linearizability As The Core Goal

  • Etcd aimed to validate linearizability, the 'Holy Grail' guarantee for distributed systems.
  • Achieving that required custom failure-injection tools and teaching the community how to debug complex scenarios.
ADVICE

Adopt Deterministic Simulation Testing

  • Use deterministic simulation testing to make executions fully reproducible and avoid flakey, hard-to-reproduce failures.
  • Codify implicit properties as assertions so tests reliably catch subtle race conditions and regressions.
Get the Snipd Podcast app to discover more snips from this episode
Get the app