
DataNation - Podcast for Data Engineers, Analysts and Scientists 32 – Data Versioning Solutions (Apache Iceberg, Project Nessie, LakeFS)
Mar 10, 2023
Dive into the world of data versioning solutions! Discover why versioning is vital for managing data changes and the pros and cons of tools like Iceberg, Project Nessie, and LakeFS. Learn about the differences between table-level, catalog-level, and file-system-level versioning. Get insights into Iceberg’s branching features, Nessie's catalog state snapshots, and LakeFS's Git-like file tracking. Plus, hear recommendations for choosing the best approach based on specific needs. It's a must-listen for data enthusiasts!
AI Snips
Chapters
Transcript
Episode notes
Iceberg Table-Level Branching
- Iceberg will support branching and tagging snapshots at the table level, creating temporary alternate snapshot paths for experimentation.
- Branches expire by default (e.g., 30 days) but can be extended by tagging to preserve specific snapshots longer.
Choose Iceberg For Lightweight Table Experiments
- Use Iceberg native branching for lightweight per-table experimentation where you don't need multi-table consistency.
- Avoid it when you expect to need coordinated multi-table transactions or large-scale rollbacks.
Catalog-Level Versioning With Project Nessie
- Project Nessie versions the entire catalog state, capturing pointers for all tables so branches represent catalog snapshots.
- This enables multi-table transactions and large-scale rollbacks by reverting the catalog commit.
