Hannes Mühleisen and Mark Raasveldt, key figures behind DuckDB, dive into their latest project, Duck Lake, aiming to simplify the lakehouse ecosystem. They discuss how Duck Lake stands out with its unified SQL database, making metadata management a breeze. The duo shares their vision for decentralized processing, local-first data architecture, and benefits like data inlining and encryption. They also touch on its seamless integration with existing systems, showcasing how it can transform data workflows and enhance user experiences.
01:10:41
forum Ask episode
web_stories AI Snips
view_agenda Chapters
auto_awesome Transcript
info_circle Episode notes
insights INSIGHT
SQL-Backed Metadata Simplifies Lakehouses
Duck Lake replaces file-heavy metadata with a SQL relational database for metadata and object stores for data files.
This simplifies the stack by reducing round trips and coordination complexity compared to Iceberg/Delta.
insights INSIGHT
Local-First Multiplayer Architecture
Duck Lake targets a 'multiplayer' DuckDB experience where compute runs on users' nodes while metadata can be centralized.
It complements hosted offerings like MotherDuck by letting users self-host compute and control deployment size.
volunteer_activism ADVICE
Scale Duck Lake To Your Needs
Start small or large: Duck Lake scales from a single-line local attach to thousands of nodes and massive storage.
Choose deployment weight that matches your team skills and growth plans to avoid unnecessary infrastructure overhead.
Get the Snipd Podcast app to discover more snips from this episode
Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.
Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystem
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what DuckLake is and the story behind it?
What are the particular problems that DuckLake is solving for?
How does this compare to the capabilities of MotherDuck?
Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?
One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?
There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?
What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)
Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?
Is it now possible to enforce PK/FK constraints, indexing on underlying data?
Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?
How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?
What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?
What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?
When is DuckLake the wrong choice?
What do you have planned for the future of DuckLake?
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.