Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Git for Data: Managing Data like Code with lakeFS

9 snips
Jan 19, 2023
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Git-Like Versioning For Object Stores

  • lakeFS provides Git-like data versioning for object stores using branches, commits, and merges without copying all data.
  • It maps Git semantics onto petabyte-scale object stores by storing pointers and metadata rather than file contents.
INSIGHT

Merkle Trees Track Data State

  • lakeFS uses a Merkle (cryptographic) tree of pointers and hashes to detect changes without storing file contents inside the system.
  • The system updates the tree on copy-on-write operations and keeps objects in the underlying S3-compatible storage.
INSIGHT

Reproducibility Through Point-in-Time Versions

  • Versioning enables exact reproducibility and point-in-time recovery to find when data corruption occurred.
  • Engineers can roll back to a historical commit to reproduce and fix issues in data pipelines.
Get the Snipd Podcast app to discover more snips from this episode
Get the app