Gnarly Data Waves by Dremio

EP38 - Building a Data Science Platform on Apache Iceberg and Nessie

Dec 11, 2023
In this insightful discussion, Jacopo Tagliabue, founder of Bauplan Labs and former AI/MLOps lead at Coveo, delves into building a modern data science platform. He explains why open-source technologies like Apache Iceberg and Project Nessie are essential for developing efficient pipelines. Jacopo highlights the importance of human-readable code, minimizing infrastructure complexity, and facilitating fast feedback loops. He also discusses how Nessie enables multi-table versioning and reproducibility, revolutionizing data management in machine learning.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Founding Story And Motivation

  • Jacopo Tagliabue described Bauplan Labs as a small startup built to solve tooling gaps they faced when scaling AI at their prior company.
  • He recounted their previous startup's acquisition and operating AI/MLOps at Coveo as motivation for Bauplan.
INSIGHT

Pipelines Must Be Human-Readable

  • Data pipelines need readable code because humans spend most time iterating in front of machines.
  • Bauplan prioritizes developer-facing pipelines to shorten feedback loops and improve productivity.
ADVICE

Use Mixed-Language, Table-Centric Design

  • Build platforms that support mixed SQL and Python to match real practitioner workflows.
  • Treat tables as first-class artifacts instead of files to simplify mental models and reuse.
Get the Snipd Podcast app to discover more snips from this episode
Get the app