Unfreezing The Data Lake: The Future-Proof File Format

13 snips

Dec 29, 2025

Xinyu Zeng, a PhD student and database researcher, dives deep into F3, the innovative 'future-proof file format' he’s developing. He highlights the limitations of existing formats like Parquet and ORC, tackling issues such as CPU-bound decoding and metadata overhead. By rethinking the layout and using WebAssembly for self-decoding, F3 aims to advance data handling. Xinyu discusses the importance of decoupling formats, supports multimodal data, and shares future directions, including integrating with existing technologies to enhance data lakes.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Research Journey To F3

Xinyu traced F3 back to his PhD work and a benchmark paper showing Parquet's shortcomings.
He initially tried a community effort but built F3 after consensus and legal hurdles slowed progress.

INSIGHT

Why F3 Was Born

Parquet's CPU-heavy decoding and historical baggage motivated F3's reimagining of file formats.
F3 targets efficiency, interoperability, and extensibility to match modern hardware and workloads.

INSIGHT

Modern Hardware Changes The Tradeoffs

Hardware and workload shifts made I/O less dominant and CPU the new bottleneck.
F3 emphasizes lightweight, parallel-friendly encodings and layout changes for ML and wide-table workloads.

Get the Snipd Podcast app to discover more snips from this episode

Get the app