Breaking Down Data Silos: AI and ML in Master Data Management

40 snips

Jan 3, 2025

Dan Bruckner, Co-founder and CTO of Tamr and former CERN physicist, shares his insights into master data management (MDM) enhanced by AI and machine learning. He discusses his transition from physics to data science, highlighting challenges in reconciling large data sets. Dan explains how data silos form within organizations and emphasizes the role of large language models in improving user experience and data trust. He advocates for combining AI capabilities with human oversight to ensure accuracy while tackling complex data management issues.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Conway's Law in Data

Organizational data becomes unwieldy because it reflects team structures, creating data silos.
This causes redundancy and difficulty in high-level decision-making due to inconsistent languages and lack of common identifiers.

ANECDOTE

26 ERP Systems

A large manufacturer with 26 ERP systems tried consolidating data for master data management.
The effort failed because the new system didn't accommodate the existing workflows of the other 17 teams, making the problem worse.

INSIGHT

MDM's Persistent Challenge

Master data management (MDM) remains a challenge despite decades of business intelligence and data warehousing efforts.
Traditional MDM systems, focusing on rules and strict data models, struggle to handle the nuances and evolving nature of organizational data.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Summary
In this episode of the Data Engineering Podcast Dan Bruckner, co-founder and CTO of Tamr, talks about the application of machine learning (ML) and artificial intelligence (AI) in master data management (MDM). Dan shares his journey from working at CERN to becoming a data expert and discusses the challenges of reconciling large-scale organizational data. He explains how data silos arise from independent teams and highlights the importance of combining traditional techniques with modern AI to address the nuances of data reconciliation. Dan emphasizes the transformative potential of large language models (LLMs) in creating more natural user experiences, improving trust in AI-driven data solutions, and simplifying complex data management processes. He also discusses the balance between using AI for complex data problems and the necessity of human oversight to ensure accuracy and trust.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us don't miss Data Citizens® Dialogues, the forward-thinking podcast brought to you by Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of Data Citizens® Dialogues, industry leaders unpack data’s impact on the world; like in their episode “The Secret Sauce Behind McDonald’s Data Strategy”, which digs into how AI-driven tools can be used to support crew efficiency and customer interactions. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now! Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.
Your host is Tobias Macey and today I'm interviewing Dan Bruckner about the application of ML and AI techniques to the challenge of reconciling data at the scale of business

Interview

Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the different ways that organizational data becomes unwieldy and needs to be consolidated and reconciled?
- How does that reconciliation relate to the practice of "master data management"
What are the scaling challenges with the current set of practices for reconciling data?
ML has been applied to data cleaning for a long time in the form of entity resolution, etc. How has the landscape evolved or matured in recent years?
- What (if any) transformative capabilities do LLMs introduce?
What are the missing pieces/improvements that are necessary to make current AI systems usable out-of-the-box for data cleaning?
What are the strategic decisions that need to be addressed when implementing ML/AI techniques in the data cleaning/reconciliation process?
What are the risks involved in bringing ML to bear on data cleaning for inexperienced teams?
What are the most interesting, innovative, or unexpected ways that you have seen ML techniques used in data resolution?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on using ML/AI in master data management?
When is ML/AI the wrong choice for data cleaning/reconciliation?
What are your hopes/predictions for the future of ML/AI applications in MDM and data cleaning?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA