
Oxide and Friends Grown-up ZFS Data Corruption Bug
20 snips
Nov 26, 2025 Join Alan Hanson, an Oxide engineer skilled in debugging storage issues, Andy Fiddaman, a ZFS-analysis expert, and Matt Ahrens, co-inventor of ZFS, as they unravel a 18-year-old ZFS data corruption bug. They discuss the confusion over initial corruption, revealing write-ordering mishaps during ZIL transaction replay. Insights from Ahrens' earlier work aided in diagnosing the persistent problem. Tune in for fascinating tales of cross-team collaboration and technical breakthroughs that refine data integrity in storage systems!
AI Snips
Chapters
Transcript
Episode notes
Hidden Corruption Reveals Delayed Failures
- Data corruption can be discovered long after it occurred, making root cause timelines hard to establish.
- Alan Hanson found a 32K-zero block at a file start, which became a key fingerprint for diagnosis.
Instrument With Lightweight Debug Data
- Add debug metadata early when reproducing intermittent corruption to gather ordering and state information.
- Matt Keeter recorded a flush number into spare bytes so future incidents would reveal write ordering.
Jepsen Testing Uncovered A Separate Bug
- Justin's Jepsen-style testing revealed a different, reproducible corruption pattern scattered through files.
- That led Matt Keeter to find a misuse of ZFS ordering guarantees causing scattered decryption failures.
