
Data Engineering Podcast Unlocking The Power of Data Lineage In Your Platform with OpenLineage
May 18, 2021
Julien Le Dem, a data engineer and CTO of Datakin, discusses the significance of data lineage in understanding data quality and pipeline impacts. He introduces OpenLineage, a project aimed at standardizing lineage metadata across various platforms, promoting collaboration among competing companies. Julien explains its core model and how it benefits data observability, trust, and reliability. He emphasizes the importance of community contributions and outlines the integration process, highlighting the pressing need for better tooling in pipeline observability.
AI Snips
Chapters
Transcript
Episode notes
Career Path That Spawned OpenLineage
- Julien traced his lineage work from Yahoo to Pig, Twitter, Parquet, Arrow, and WeWork.
- Building Marquez at WeWork exposed the need that led to Datakin and OpenLineage.
Running Jobs As The Core Model
- OpenLineage models running jobs as the core building block for many lineage use cases.
- Capturing job runs (inputs, outputs, start/finish) makes lineage reusable across governance, operations, and quality tools.
Begin With A Minimal Core Plus Facets
- Start with the minimal core spec and extend via facets to avoid slow monolithic debates.
- Use OpenLineage clients and Marquez as reference implementations to deploy and test quickly.
