Data Engineering Podcast

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

May 18, 2021
Julien Le Dem, a data engineer and CTO of Datakin, discusses the significance of data lineage in understanding data quality and pipeline impacts. He introduces OpenLineage, a project aimed at standardizing lineage metadata across various platforms, promoting collaboration among competing companies. Julien explains its core model and how it benefits data observability, trust, and reliability. He emphasizes the importance of community contributions and outlines the integration process, highlighting the pressing need for better tooling in pipeline observability.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Career Path That Spawned OpenLineage

  • Julien traced his lineage work from Yahoo to Pig, Twitter, Parquet, Arrow, and WeWork.
  • Building Marquez at WeWork exposed the need that led to Datakin and OpenLineage.
INSIGHT

Running Jobs As The Core Model

  • OpenLineage models running jobs as the core building block for many lineage use cases.
  • Capturing job runs (inputs, outputs, start/finish) makes lineage reusable across governance, operations, and quality tools.
ADVICE

Begin With A Minimal Core Plus Facets

  • Start with the minimal core spec and extend via facets to avoid slow monolithic debates.
  • Use OpenLineage clients and Marquez as reference implementations to deploy and test quickly.
Get the Snipd Podcast app to discover more snips from this episode
Get the app