Intro

This chapter focuses on Apache Gravitino, a meta-catalog designed for unified data governance and security. The discussion compares Gravitino with other data catalog solutions, emphasizing its support for multiple engines and formats, as well as the advantages of open-source offerings in data management.

Play episode from 00:00

chevron_right

Transcript

chevron_right

Transcript

Episode notes

In this episode of The Data Engineering Show, the bros sit with Lisa Cao, Product Manager at DataStrato, to explore data catalogs and Apache Gravitino, a unified metadata lake used to manage access and perform data governance for all data sources.

What You’ll Learn:

How Apache Gravitino differs from others like Unity catalog and Polaris by being able to support multiple catalog systems.
What the “Push-Down Permission Management” security model is and how to implement it across different data systems.
How to maintain consistent governance across various query engines like Spark, Trino, and Flink.
Why interoperability, flexibility and open source ecosystem are becoming an important dynamics of data infrastructure rather than performance benchmarking.
How to evaluate new data tools based on their real-world adoption rather than the social media hype.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts instructions on how to do this here [insert link].

Lisa Cao is a Product Manager at DataStrato, specializing in AI/ML product partnerships and developer relations. With deep expertise in data catalog technologies and open-source ecosystems, she plays a key role in developing Apache Gravitino, an ASF incubating project that provides a unified governance and security layer for diverse data systems. Her work in developing extensible catalog frameworks has helped organizations manage complex data environments across multiple platforms.

Episode Highlights:

What is Apache Gravitino? (01:24)

Apache Gravitino is a meta-catalog that serves as a unified data governance and security layer used to manage different data systems. Lisa shares that Gravitino was the first to release an iceberg rest catalog and ended up open sourcing for the general community to use and as time passed, Polaris and Unity Catalog were also announced in open source. She highlights that although Gravitino, Polaris and Unity Catalog are very similar, Gravitino differs in that it is able to support multiple catalogs.

Unifying AI/ML and Big Data Stack (03:15)

One of the interesting things about Gravitino is that it offers more than just a catalog of data models and these model catalogs are the first step into looking at how to merge two worlds of AI and ML catalogs. Lisa shares the goal of effective management, that is, creating a system that can store and manage different types of data models, track changes to the models, and control access to the models.

Simplifying Data Governance (10:49)

Think of Gravitino as a “traffic cop” that helps to manage and secure data from multiple sources. It is crucial to have a system that provides unified access control across all data sources, allowing teams to manage access and data governance so that ML teams don't have to worry about access. Lisa says that Apache Gravitino is the system that makes data accessible to different teams and users while making sure that it is secure and governed appropriately.

The Gravitino’s Query Engine Solution (21:34)

Every query engine has its own way of managing data, which makes it difficult to switch between engines - you have to reconfigure everything. Lisa highlights that Gravitino solves the problem by providing a single layer of data governance that works across multiple query engines.

Navigating the Fast-Paced World of Data Engineering (24:41)

Lisa talks about how fast the data engineering space is moving and shares some insights to catching up;

Don’t try to learn everything at once.
Don't get too deep into every tool
Look for real-world adoption

She warns against the social media hype that can amplify the messaging around new tools, making it seem everyone is using it, when in reality, that can’t be easily seen.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

Episode Resources:

Apache Gravitino website

For Feedback & Discussions on Firebolt Core:

Join Firebolt Discord Community
Join Firebolt GitHub Discussions
Firebolt Core Github Repository
Benjamin@Firebolt.io

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes:

Zach Wilson on What Makes a Great Data Engineer
Joe Reis and Matt Housley on The Fundamentals of Data Engineering
Bill Inmon, The Godfather of Data Warehousing

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books