AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Nessie is a Git-like versioned catalog for data lakes using Apache Iceberg. It provides branching and commit capabilities at the catalog level, enabling Git-like semantics for data ops practices such as disaster recovery. Nessie's key feature is creating branches and commits at the catalog level, changing how developers interact with the data and facilitating new data lake house patterns.
Nessie allows for easy data rollback by rolling back entire catalogs to clean commits, streamlining the process compared to rolling back individual tables. It enables creating isolated branch environments without duplicating existing data, reducing storage costs. The branching and merging options help manage testing and experimentation scenarios without impacting the main production data, simplifying stress testing and worst-case scenario simulations.
For deployment, there are Docker containers for testing and Helm charts for production deployment. Catalog migration is simplified with the CLI tool for copying references between catalogs, aiding in transitioning existing setups to work with Nessie. When integrated with tools like Dremio, Nessie's versioning and branching primitives can be leveraged seamlessly into existing data workflows, such as in SQL queries.
Future plans for Nessie include enhancing context awareness to facilitate more sophisticated merges and adding high Iceberg support to expand compatibility. The aim is to make Nesse a versatile catalog tool that can support different table formats beyond Iceberg. The collaborative environment on the Nessie project and the open communication channels like Zulip allow for community participation in shaping and advancing the project's roadmap.
Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode