Data Engineering Podcast

Unfreezing The Data Lake: The Future-Proof File Format

11 snips
Dec 29, 2025
Xinyu Zeng, a PhD student and database researcher, dives deep into F3, the innovative 'future-proof file format' he’s developing. He highlights the limitations of existing formats like Parquet and ORC, tackling issues such as CPU-bound decoding and metadata overhead. By rethinking the layout and using WebAssembly for self-decoding, F3 aims to advance data handling. Xinyu discusses the importance of decoupling formats, supports multimodal data, and shares future directions, including integrating with existing technologies to enhance data lakes.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Research Journey To F3

  • Xinyu traced F3 back to his PhD work and a benchmark paper showing Parquet's shortcomings.
  • He initially tried a community effort but built F3 after consensus and legal hurdles slowed progress.
INSIGHT

Why F3 Was Born

  • Parquet's CPU-heavy decoding and historical baggage motivated F3's reimagining of file formats.
  • F3 targets efficiency, interoperability, and extensibility to match modern hardware and workloads.
INSIGHT

Modern Hardware Changes The Tradeoffs

  • Hardware and workload shifts made I/O less dominant and CPU the new bottleneck.
  • F3 emphasizes lightweight, parallel-friendly encodings and layout changes for ML and wide-table workloads.
Get the Snipd Podcast app to discover more snips from this episode
Get the app