
Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov Git for Data: Managing Data like Code with lakeFS
9 snips
Jan 19, 2023 AI Snips
Chapters
Transcript
Episode notes
Git-Like Versioning For Object Stores
- lakeFS provides Git-like data versioning for object stores using branches, commits, and merges without copying all data.
- It maps Git semantics onto petabyte-scale object stores by storing pointers and metadata rather than file contents.
Merkle Trees Track Data State
- lakeFS uses a Merkle (cryptographic) tree of pointers and hashes to detect changes without storing file contents inside the system.
- The system updates the tree on copy-on-write operations and keeps objects in the underlying S3-compatible storage.
Reproducibility Through Point-in-Time Versions
- Versioning enables exact reproducibility and point-in-time recovery to find when data corruption occurred.
- Engineers can roll back to a historical commit to reproduce and fix issues in data pipelines.
