

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Jan 23, 2022 • 56min
The Importance Of Data Contracts As The Interface For Data Integration With Abhi Sivasailam
Summary
Data platforms are exemplified by a complex set of connections that are subject to a set of constantly evolving requirements. In order to make this a tractable problem it is necessary to define boundaries for communication between concerns, which brings with it the need to establish interface contracts for communicating across those boundaries. The recent move toward the data mesh as a formalized architecture that builds on this design provides the language that data teams need to make this a more organized effort. In this episode Abhi Sivasailam shares his experience designing and implementing a data mesh solution with his team at Flexport, and the importance of defining and enforcing data contracts that are implemented at those domain boundaries.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
Your host is Tobias Macey and today I’m interviewing Abhi Sivasailam about the different social and technical interfaces available for defining and enforcing data contracts
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what your working definition of a "data contract" is?
What are the goals and purpose of these contracts?
What are the locations and methods of defining a data contract?
What kind of information needs to be encoded in a contract definition?
How do you manage enforcement of contracts?
manifestations of contracts in data mesh implementation
ergonomics (technical and social) of data contracts and how to prevent them from prohibiting productivity
What are the most interesting, innovative, or unexpected approaches to data contracts that you have seen?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contract implementation?
When are data contracts the wrong choice?
Contact Info
LinkedIn
@_abhisivasailam on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Flexport
Debezium
Podcast Episode
Data Mesh At Flexport Presentation
Data Mesh
Podcast Episode
Column Names As Contracts podcast episode with Emily Riederer
dbtplyr
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 23, 2022 • 53min
Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig
Ashish Mrig, head of the data analytics platform at Wayfair and organizer of a local data engineering meetup, shares his insights into managing data teams in large organizations. He discusses the challenges of balancing stakeholder demands with technological advancement, particularly in cloud migration. Ashish delves into the evolution of data engineering roles, from analytics to machine learning workloads, and emphasizes the importance of strategic planning in data architecture. He also touches on common pitfalls in data management and the future of data quality and technology.

Jan 15, 2022 • 1h 3min
Automated Data Quality Management Through Machine Learning With Anomalo
Summary
Data quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Elliot Shmukler and Jeremy Stanley about Anomalo, a data quality platform aiming to automate issue detection with zero setup
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Anomalo is and the story behind it?
Managing data quality is ostensibly about building trust in your data. What are the promises that data teams are able to make about the information in their control when they are using Anomalo?
What are some of the claims that cannot be made unequivocally when relying on data quality monitoring systems?
types of data quality issues identified
utility of automated vs programmatic tests
Can you describe how the Anomalo system is designed and implemented?
How have the design and goals of the platform changed or evolved since you started working on it?
What is your approach for validating changes to the business logic in your platform given the unpredictable nature of the system under test?
model training/customization process
statistical model
seasonality/windowing
CI/CD
With any monitoring system the most challenging thing to do is avoid generating alerts that aren’t actionable or helpful. What is your strategy for helping your customers avoid alert fatigue?
What are the most interesting, innovative, or unexpected ways that you have seen Anomalo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomalo?
When is Anomalo the wrong choice?
What do you have planned for the future of Anomalo?
Contact Info
Elliot
LinkedIn
@eshmu on Twitter
Jeremy
LinkedIn
@jeremystan on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Anomalo
Great Expectations
Podcast Episode
Shapley Values
Gradient Boosted Decision Tree
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Jan 15, 2022 • 50min
An Introduction To Data And Analytics Engineering For Non-Programmers
Summary
Applications of data have grown well beyond the venerable business intelligence dashboards that organizations have relied on for decades. Now it is being used to power consumer facing services, influence organizational behaviors, and build sophisticated machine learning systems. Given this increased level of importance it has become necessary for everyone in the business to treat data as a product in the same way that software applications have driven the early 2000s. In this episode Brian McMillan shares his work on the book "Building Data Products" and how he is working to educate business users and data professionals about the combination of technical, economical, and business considerations that need to be blended for these projects to succeed.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
Your host is Tobias Macey and today I’m interviewing Brian McMillan about building data products and his book to introduce the work of data analysts and engineers to non-programmers
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what motivated you to write a book about the work of building data products?
Who is your target audience?
What are the main goals that you are trying to achieve through the book?
What was your approach for determining the structure and contents of the book?
What are the core principles of data engineering that have remained from the original wave of ETL tools and rigid data warehouses?
What are some of the new foundational elements of data products that need to be codified for the next generation of organizations and data professionals?
There is a lot of activity and conversation happening in and around data which can make it difficult to understand which parts are signal and which are noise. What, if anything, do you see as being truly new and/or innovative?
Are there any core lessons or principles that you consider to be at risk of getting drowned out in the current frenzy of activity?
How do the practices for building products with small teams differ from those employed by larger groups?
What do you see as the threshold beyond which a team can no longer be considered "small"?
What are the roles/skills/titles that you view as necessary for building data products in the current phase of maturity for the ecosystem?
What do you see as the biggest risks to engineering and data teams?
What are the most interesting, innovative, or unexpected ways that you have seen the principles in the book used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on the book?
Contact Info
Email
twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Building Data Products: Introduction to Data and Analytics Engineering for non-programmers
Theory of Constraints
Throughput Economics
"Swaptronics" – The act of swapping out electronic components until you find a combination that works.
Informatica
SSIS – Microsoft SQL Server Integration Services
3X – Kent Beck
Wardley Maps
Vega Lite
Datasette
Why Use Make – Mike Bostock
Building Production Applications Using Go & SQLite
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 8, 2022 • 45min
Open Source Reverse ETL For Everyone With Grouparoo
Summary
Reverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own implementation of it. While struggling with the work of automating data integration workflows with marketing, sales, and support tools Brian Leonard accidentally discovered this need himself and turned it into the open source framework Grouparoo. In this episode he explains why he decided to turn these efforts into an open core business, how the platform is implemented, and the benefits of having an open source contender in the landscape of operational analytics products.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Brian Leonard about Grouparoo, an open source framework for managing your reverse ETL pipelines
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Grouparoo is and the story behind it?
What are the core requirements for building a reverse ETL system?
What are the additional capabilities that users of the system ask for as they get more advanced in their usage?
Who is your target user for Grouparoo and how does that influence your priorities on feature development and UX design?
What are the benefits of building an open source core for a reverse ETL platform as compared to the other commercial options?
Can you describe the architecture and implementation of the Grouparoo project?
What are the additional systems that you have built to support the hosted offering?
How have the design and goals of the project changed since you first started working on it?
What is the workflow for getting Grouparoo deployed and set up with an initial pipeline?
How does Grouparoo handle model and schema evolution and potential mismatch in the data warehouse and destination systems?
What is the process for building a new integration and getting it included in the official list of plugins?
What is your strategy/philosophy around which features are included in the open source vs. hosted/enterprise offerings?
What are the most interesting, innovative, or unexpected ways that you have seen Grouparoo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grouparoo?
When is Grouparoo the wrong choice?
What do you have planned for the future of Grouparoo?
Contact Info
LinkedIn
@bleonard on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Grouparoo
GitHub
Task Rabbit
Snowflake
Podcast Episode
Looker
Podcast Episode
Customer Data Platform
Podcast Episode
dbt
Open Source Data Stack Conference
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Jan 8, 2022 • 51min
Data Observability Out Of The Box With Metaplane
Summary
Data observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks, from warehouses to BI dashboards and everything in between.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Metaplane is and the story behind it?
Data observability is an area that has seen a huge amount of activity over the past couple of years. What is your working definition of that term?
What are the areas of differentiation that you see across vendors in the space?
Can you describe how the Metaplane platform is architected?
How have the design and goals of Metaplane changed or evolved since you started working on it?
establishing seasonality in data metrics
blind spots from operating at the level of the data warehouse
What are the most interesting, innovative, or unexpected ways that you have seen Metaplane used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaplane?
When is Metaplane the wrong choice?
What do you have planned for the future of Metaplane?
Contact Info
LinkedIn
@kevinzhenghu on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Metaplane
Datadog
Control Theory
James Clerk Maxwell
Centrifugal Governor
Huygens
Amazon ECS
Stop Hiring Devops Experts (And Start Growing Them)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 2, 2022 • 1h 1min
Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary
Summary
Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you start by discussing some of the anti-patterns that you have encountered in data warehouse naming conventions and how it relates to the modeling approach? (e.g. star/snowflake schema, data vault, etc.)
What are some of the types of contracts that can, and should, be defined and enforced in data workflows?
What are the boundaries where we should think about establishing those contracts?
What is the utility of column and table names for defining and enforcing contracts in analytical work?
What is the process for establishing contractual elements in a naming schema?
Who should be involved in that design process?
Who are the participants in the communication paths for column naming contracts?
What are some examples of context and details that can’t be captured in column names?
What are some options for managing that additional information and linking it to the naming contracts?
Can you describe the work that you have done with dbtplyr to make name contracts a supported construct in dbt projects?
How does dbtplyr help in the creation and enforcement of contracts in the development of dbt workflows
How are you using dbtplyr in your own work?
How do you handle the work of building transformations to make data comply with contracts?
What are the supplemental systems/techniques/documentation to work with name contracts and how they are leveraged by downstream consumers?
What are the most interesting, innovative, or unexpected ways that you have seen naming contracts and/or dbtplyr used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbtplyr?
When is dbtplyr the wrong choice?
What do you have planned for the future of dbtplyr?
Contact Info
Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
dbtplyr
Great Expectations
Podcast Episode
Controlled Vocabularies Presentation
dplyr
Data Vault
Podcast Episode
OpenMetadata
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Jan 2, 2022 • 1h 3min
A Reflection On The Data Ecosystem For The Year 2021
Summary
This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy about the key themes of 2021 in the data ecosystem and what to expect for next year
Interview
Introduction
How did you get involved in the area of data management?
What were the main themes that you saw data practitioners and vendors focused on this year?
What is the major bottleneck for Data teams in 2021? Will it be the same in 2022?
One of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now?
Will SQL be challenged as a primary interface to analytical data?
In 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain.
To what extent does speed matter?
Over the past couple of months, we’ve seen the resurgence of “benchmark wars” between major data warehousing platforms. To what extent do speed benchmarks inform decisions for modern data teams? How important is query speed in a modern data workflow? What needs to be true about your current DWH solution and potential alternatives to make a move?
How has the way data teams work been changing?
In 2020 remote seemed like a temporary emergency state. In 2021, it went mainstream. How has that affected the day-to-day of data teams, how they collaborate internally and with stakeholders?
What’s it like to be a data vendor in 2021?
Vertically integrated vs. modular data stack?
There are multiple forces in play. Will the stack continue to be fragmented? Will we see major consolidation? If so, in which parts of the stack?
Contact Info
Maura
LinkedIn
Website
@outoftheverse on Twitter
David
LinkedIn
@davidjwallace on Twitter
dwallace0723 on GitHub
Benn
LinkedIn
@bennstancil on Twitter
Gleb
LinkedIn
@glebmm on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Patreon
Dutchie
Mode Analytics
Datafold
Podcast Episode
Locally Optimistic
RJ Metrics
Stitch
Mozart Data
Podcast Episode
Dagster
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

13 snips
Dec 27, 2021 • 1h 11min
Revisiting The Technical And Social Benefits Of The Data Mesh
Summary
The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m welcoming back Zhamak Dehghani to talk about her work on the data mesh book and the lessons learned over the past 2 years
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving a brief recap of the principles of the data mesh and the story behind it?
How has your view of the principles of the data mesh changed since our conversation in July of 2019?
What are some of the ways that your work on the data mesh book influenced your thinking on the practical elements of implementing a data mesh?
What do you view as the as-yet-unknown elements of the technical and social design constructs that are needed for a sustainable data mesh implementation?
In the opening of your book you state that "Data Mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale". As with everything, scale is subjective, but what are some of the heuristics that you rely on for determining when a data mesh is an appropriate solution?
What are some of the ways that data mesh concepts manifest at the boundaries of organizations?
While the idea of federated access to data product quanta reduces the amount of coordination necessary at the organizational level, it raises the spectre of more complex logic required for consumers of multiple quanta. How can data mesh implementations mitigate the impact of this problem?
What are some of the technical components that you have found to be best suited to the implementation of data elements within a mesh?
What are the technological components that are still missing for a mesh-native data platform?
How should an organization that wishes to implement a mesh style architecture think about the roles and skills that they will need on staff?
How can vendors factor into the solution?
What is the role of application developers in a data mesh ecosystem and how do they need to change their thinking around the interfaces that they provide in their products?
What are the most interesting, innovative, or unexpected ways that you have seen data mesh principles used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh implementations?
When is a data mesh the wrong approach?
What do you think the future of the data mesh will look like?
Contact Info
LinkedIn
@zhamakd on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Data Engineering Podcast Data Mesh Interview
Data Mesh Book
Thoughtworks
Expert Systems
OpenLineage
Podcast Episode
Data Mesh Learning
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 27, 2021 • 58min
Exploring The Evolving Role Of Data Engineers
Summary
Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers
Interview
Introduction
How did you get involved in the area of data management?
What is your current working definition of a data engineer?
How has that definition changed since your article on the "rise of the data engineer" and episode 3 of this show about "defining data engineering"?
How has the growing availability of data infrastructure services shifted foundational skills and knowledge that are necessary to be effective?
How should a new/aspiring data engineer focus their time and energy to become effective?
One of the core themes in this current spate of technologies is "democratization of data". In your post on the downfall of the data engineer you called out the pressure on data engineers to maintain control with so many contributors with varying levels of skill and understanding. How well is the "modern data stack" balancing these concerns?
An interesting impact of the growing usage of data is the constrained availability of data engineers. How do you see the effects of the job market on driving evolution of tooling and services?
With the explosion of tools and services for working with data, a new problem has evolved of which ones to use for a given organization. What do you see as an effective and efficient process for enumerating and evaluating the available components for building a stack?
There is also a lot of conversation around the "modern data stack", as well as the need for companies to build a "data platform". What (if any) difference do you see in the implications of those phrases and the skills required to compile a stack vs build a platform?
How do you view the long term viability of templated SQL as a core workflow for transformations?
What is the impact of more acessible and widespread machine learning/deep learning on data engineers/data infrastructure?
How evenly distributed across industries and geographies are the advances in data infrastructure and engineering practices?
What are some of the opportunities that are being missed or squandered during this dramatic shift in the data engineering landscape?
What are the most interesting, innovative, or unexpected ways that you have seen the data ecosytem evolve?
What are the most interesting, unexpected, or challenging lessons that you have learned while contributing to and participating in the data ecosystem?
In episode 3 of this show (almost five years ago) we closed with some predictions for the following years of data engineering, many of which have been proven out. What is your retrospective on those claims, and what are your new predictions for the upcoming years?
Contact Info
LinkedIn
@mistercrunch on Twitter
mistercrunch on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
How the Modern Data Stack is Reshaping Data Engineering
The Rise of the Data Engineer
The Downfall of the Data Engineer
Defining Data Engineering – Data Engineering Podcast
Airflow
Superset
Podcast Episode
Preset
Fivetran
Podcast Episode
Meltano
Podcast Episode
Airbyte
Podcast Episode
Ralph Kimball
Bill Inmon
Feature Store
Prophecy.io
Podcast Episode
Ab Initio
Dremio
Podcast Episode
Data Mesh
Podcast Episode
Firebolt
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast