Summary
Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today.
Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?
Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?
What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018?
Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?
Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?
For someone who wants to manage their data in Iceberg tables, what does the implementation look like?
How does that change based on the type of query/processing engine being used?
Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance?
What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular?
When is Iceberg/Tabular the wrong choice?
What do you have planned for the future of Iceberg/Tabular?
Contact Info
LinkedIn
rdblue on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Iceberg
Podcast Episode
Hadoop
Data Lakehouse
ACID == Atomic, Consistent, Isolated, Durable
Apache Hive
Apache Impala
Bodo
Podcast Episode
StarRocks
Dremio
Podcast Episode
DDL == Data Definition Language
Trino
PrestoDB
Apache Hudi
Podcast Episode
dbt
Apache Flink
TileDB
Podcast Episode
CDC == Change Data Capture
Substrait
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Acryl: ![Acryl](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/2E3zCRd4.png)
The modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at [dataengineeringpodcast.com/acryl](https://www.dataengineeringpodcast.com/acryl)Support Data Engineering Podcast