183: Why Modern Data Quality Must Move Beyond Traditional Data Management Practices with Chad Sanderson of Gable.ai
Mar 27, 2024
auto_awesome
Data expert Chad Sanderson discusses modern data quality and management practices on this podcast. Topics include challenges with the modern data stack, rethinking data catalogs, AI impact on data, incentivizing engineers for data quality, and the role of AI in data semantics. The conversation also touches on data as a product, quantifying the cost of data changes, and the importance of slowing down to go faster in data management.
Maintaining data quality at the source ensures downstream reliability, emphasizing collaboration among data stakeholders.
Implementing modern data stacks provides initial value but requires effort for long-term maintenance.
Gable.ai addresses data quality challenges by enforcing contracts, improving metadata, and utilizing AI for semantic understanding.
Deep dives
The Need for Data Quality in Data Infrastructure
In the podcast episode, Chad Sanderson discusses the importance of data quality in ensuring the reliability of data infrastructure. He highlights the significance of understanding the supply chain around data and the challenges that arise from organizational issues and the interconnectedness of different engineering teams. Sanderson emphasizes the importance of maintaining data quality at the source to ensure downstream data reliability.
The Evolution of Data Stacks and the Modern Data Stack's Challenges
Sanderson delves into the evolution of data stacks and the challenges faced in modern data stack implementations. He mentions the initial utility and value gained by adopting modern data stacks but highlights the difficulties in maintaining such systems over time. Sanderson refers to data as a supply chain, emphasizing the need for interconnectedness and collaboration among different data stakeholders for effective data management.
Addressing Data Quality Issues with Gable.ai
Sanderson introduces gable.ai, a platform aimed at tackling data quality, compliance, and governance issues faced by organizations. He explains the challenges arising from upstream data producers lacking awareness of downstream data implications and the limitations in existing tools to effectively address these issues. Gable.ai aims to act as a data management surface that enables engineers and data platform managers to monitor, enforce data contracts, and ensure data quality.
Revamping Data Catalogs and Data Lineage for Improved Data Management
Sanderson discusses the limitations of traditional data catalogs and the importance of enhancing metadata for improved data management. He emphasizes the need for robust data lineage and semantic information to provide context and meaning to data assets. Sanderson highlights the role of AI in addressing semantic metadata challenges and ensuring data trustworthiness and scalability.
Implementing Data Contracts and Treating Data as a Product
Sanderson advocates for treating data as a product and emphasizes the necessity of implementing data contracts to ensure data reliability. He suggests delineating between production and non-production data assets and implementing rigorous quality checks, similar to software development practices. Sanderson recommends creating tier one data services and quantifying the cost of data quality issues to drive awareness and accountability within organizations.
Comparing Data Supply Chain to Real-world Supply Chains (4:49)
Overview of Gable.ai (8:05)
Rethinking Data Catalogs (11:42)
New Ideas for Managing Data (15:16)
Data Discovery and Governance Challenges (18:51)
Static Code Analysis and AI Impact on Data (24:55)
Creating Contracts and Defining Data Lineage (27:31)
Data Quality Issues and Upstream Problems (32:32)
Challenges with Third-Party Vendors and External Data (34:29)
Incentivizing Engineers for Data Quality (40:28)
Feedback Loops and Actionability in Data Catalogs (45:30)
Missing metadata (48:57)
Role of AI in data semantics (50:27)
Data as a product (54:26)
Slowing down to go faster (57:38)
Quantifying the cost of data changes (1:01:24)
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode