

Adventures in Data Corruption
39 snips Jul 10, 2025
John Gallagher and Rain Paharia, both software engineers at Oxide Computer Company, join to unveil a bizarre data corruption mystery that emerged during a simple network transfer. They discuss their painstaking debugging journey, tackling issues like CPU speculation and its impact on data integrity. The duo shares valuable insights on troubleshooting strategies for non-deterministic bugs and the surprising connections to memory management vulnerabilities. Be prepared for humorous moments as they draw parallels between tech challenges and nostalgic pop culture.
AI Snips
Chapters
Transcript
Episode notes
Sled Recovery Process Explained
- John Gallagher described the sled recovery process in detail, highlighting its critical role and complexity.
- They rely on streaming a minimal OS over the network to recover non-bootable sleds safely and reliably.
Hash Checks Unveil Corruption
- Adding hash checks before and after disk writes helps catch elusive data corruption.
- This provides crucial information to distinguish true corruption from write errors early.
Zero Runs Reveal Data Corruption
- Rain Paharia injected checks for long runs of zeros in network data buffers to catch corruption.
- This exposed intermittent zero-filled data that shouldn’t be present, confirming data corruption in transit or reception.