Oxide and Friends

Grown-up ZFS Data Corruption Bug

20 snips
Nov 26, 2025
Join Alan Hanson, an Oxide engineer skilled in debugging storage issues, Andy Fiddaman, a ZFS-analysis expert, and Matt Ahrens, co-inventor of ZFS, as they unravel a 18-year-old ZFS data corruption bug. They discuss the confusion over initial corruption, revealing write-ordering mishaps during ZIL transaction replay. Insights from Ahrens' earlier work aided in diagnosing the persistent problem. Tune in for fascinating tales of cross-team collaboration and technical breakthroughs that refine data integrity in storage systems!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Hidden Corruption Reveals Delayed Failures

  • Data corruption can be discovered long after it occurred, making root cause timelines hard to establish.
  • Alan Hanson found a 32K-zero block at a file start, which became a key fingerprint for diagnosis.
ADVICE

Instrument With Lightweight Debug Data

  • Add debug metadata early when reproducing intermittent corruption to gather ordering and state information.
  • Matt Keeter recorded a flush number into spare bytes so future incidents would reveal write ordering.
ANECDOTE

Jepsen Testing Uncovered A Separate Bug

  • Justin's Jepsen-style testing revealed a different, reproducible corruption pattern scattered through files.
  • That led Matt Keeter to find a misuse of ZFS ordering guarantees causing scattered decryption failures.
Get the Snipd Podcast app to discover more snips from this episode
Get the app