DataNation - Podcast for Data Engineers, Analysts and Scientists

32 – Data Versioning Solutions (Apache Iceberg, Project Nessie, LakeFS)

Mar 10, 2023
Dive into the world of data versioning solutions! Discover why versioning is vital for managing data changes and the pros and cons of tools like Iceberg, Project Nessie, and LakeFS. Learn about the differences between table-level, catalog-level, and file-system-level versioning. Get insights into Iceberg’s branching features, Nessie's catalog state snapshots, and LakeFS's Git-like file tracking. Plus, hear recommendations for choosing the best approach based on specific needs. It's a must-listen for data enthusiasts!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Iceberg Table-Level Branching

  • Iceberg will support branching and tagging snapshots at the table level, creating temporary alternate snapshot paths for experimentation.
  • Branches expire by default (e.g., 30 days) but can be extended by tagging to preserve specific snapshots longer.
ADVICE

Choose Iceberg For Lightweight Table Experiments

  • Use Iceberg native branching for lightweight per-table experimentation where you don't need multi-table consistency.
  • Avoid it when you expect to need coordinated multi-table transactions or large-scale rollbacks.
INSIGHT

Catalog-Level Versioning With Project Nessie

  • Project Nessie versions the entire catalog state, capturing pointers for all tables so branches represent catalog snapshots.
  • This enables multi-table transactions and large-scale rollbacks by reverting the catalog commit.
Get the Snipd Podcast app to discover more snips from this episode
Get the app