
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

Jan 2, 2022 • 1h 1min
Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary
Summary
Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you start by discussing some of the anti-patterns that you have encountered in data warehouse naming conventions and how it relates to the modeling approach? (e.g. star/snowflake schema, data vault, etc.)
What are some of the types of contracts that can, and should, be defined and enforced in data workflows?
What are the boundaries where we should think about establishing those contracts?
What is the utility of column and table names for defining and enforcing contracts in analytical work?
What is the process for establishing contractual elements in a naming schema?
Who should be involved in that design process?
Who are the participants in the communication paths for column naming contracts?
What are some examples of context and details that can’t be captured in column names?
What are some options for managing that additional information and linking it to the naming contracts?
Can you describe the work that you have done with dbtplyr to make name contracts a supported construct in dbt projects?
How does dbtplyr help in the creation and enforcement of contracts in the development of dbt workflows
How are you using dbtplyr in your own work?
How do you handle the work of building transformations to make data comply with contracts?
What are the supplemental systems/techniques/documentation to work with name contracts and how they are leveraged by downstream consumers?
What are the most interesting, innovative, or unexpected ways that you have seen naming contracts and/or dbtplyr used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbtplyr?
When is dbtplyr the wrong choice?
What do you have planned for the future of dbtplyr?
Contact Info
Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
dbtplyr
Great Expectations
Podcast Episode
Controlled Vocabularies Presentation
dplyr
Data Vault
Podcast Episode
OpenMetadata
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Jan 2, 2022 • 1h 3min
A Reflection On The Data Ecosystem For The Year 2021
Summary
This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy about the key themes of 2021 in the data ecosystem and what to expect for next year
Interview
Introduction
How did you get involved in the area of data management?
What were the main themes that you saw data practitioners and vendors focused on this year?
What is the major bottleneck for Data teams in 2021? Will it be the same in 2022?
One of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now?
Will SQL be challenged as a primary interface to analytical data?
In 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain.
To what extent does speed matter?
Over the past couple of months, we’ve seen the resurgence of “benchmark wars” between major data warehousing platforms. To what extent do speed benchmarks inform decisions for modern data teams? How important is query speed in a modern data workflow? What needs to be true about your current DWH solution and potential alternatives to make a move?
How has the way data teams work been changing?
In 2020 remote seemed like a temporary emergency state. In 2021, it went mainstream. How has that affected the day-to-day of data teams, how they collaborate internally and with stakeholders?
What’s it like to be a data vendor in 2021?
Vertically integrated vs. modular data stack?
There are multiple forces in play. Will the stack continue to be fragmented? Will we see major consolidation? If so, in which parts of the stack?
Contact Info
Maura
LinkedIn
Website
@outoftheverse on Twitter
David
LinkedIn
@davidjwallace on Twitter
dwallace0723 on GitHub
Benn
LinkedIn
@bennstancil on Twitter
Gleb
LinkedIn
@glebmm on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Patreon
Dutchie
Mode Analytics
Datafold
Podcast Episode
Locally Optimistic
RJ Metrics
Stitch
Mozart Data
Podcast Episode
Dagster
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

13 snips
Dec 27, 2021 • 1h 11min
Revisiting The Technical And Social Benefits Of The Data Mesh
Summary
The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m welcoming back Zhamak Dehghani to talk about her work on the data mesh book and the lessons learned over the past 2 years
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving a brief recap of the principles of the data mesh and the story behind it?
How has your view of the principles of the data mesh changed since our conversation in July of 2019?
What are some of the ways that your work on the data mesh book influenced your thinking on the practical elements of implementing a data mesh?
What do you view as the as-yet-unknown elements of the technical and social design constructs that are needed for a sustainable data mesh implementation?
In the opening of your book you state that "Data Mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale". As with everything, scale is subjective, but what are some of the heuristics that you rely on for determining when a data mesh is an appropriate solution?
What are some of the ways that data mesh concepts manifest at the boundaries of organizations?
While the idea of federated access to data product quanta reduces the amount of coordination necessary at the organizational level, it raises the spectre of more complex logic required for consumers of multiple quanta. How can data mesh implementations mitigate the impact of this problem?
What are some of the technical components that you have found to be best suited to the implementation of data elements within a mesh?
What are the technological components that are still missing for a mesh-native data platform?
How should an organization that wishes to implement a mesh style architecture think about the roles and skills that they will need on staff?
How can vendors factor into the solution?
What is the role of application developers in a data mesh ecosystem and how do they need to change their thinking around the interfaces that they provide in their products?
What are the most interesting, innovative, or unexpected ways that you have seen data mesh principles used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh implementations?
When is a data mesh the wrong approach?
What do you think the future of the data mesh will look like?
Contact Info
LinkedIn
@zhamakd on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Data Engineering Podcast Data Mesh Interview
Data Mesh Book
Thoughtworks
Expert Systems
OpenLineage
Podcast Episode
Data Mesh Learning
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 27, 2021 • 58min
Exploring The Evolving Role Of Data Engineers
Summary
Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers
Interview
Introduction
How did you get involved in the area of data management?
What is your current working definition of a data engineer?
How has that definition changed since your article on the "rise of the data engineer" and episode 3 of this show about "defining data engineering"?
How has the growing availability of data infrastructure services shifted foundational skills and knowledge that are necessary to be effective?
How should a new/aspiring data engineer focus their time and energy to become effective?
One of the core themes in this current spate of technologies is "democratization of data". In your post on the downfall of the data engineer you called out the pressure on data engineers to maintain control with so many contributors with varying levels of skill and understanding. How well is the "modern data stack" balancing these concerns?
An interesting impact of the growing usage of data is the constrained availability of data engineers. How do you see the effects of the job market on driving evolution of tooling and services?
With the explosion of tools and services for working with data, a new problem has evolved of which ones to use for a given organization. What do you see as an effective and efficient process for enumerating and evaluating the available components for building a stack?
There is also a lot of conversation around the "modern data stack", as well as the need for companies to build a "data platform". What (if any) difference do you see in the implications of those phrases and the skills required to compile a stack vs build a platform?
How do you view the long term viability of templated SQL as a core workflow for transformations?
What is the impact of more acessible and widespread machine learning/deep learning on data engineers/data infrastructure?
How evenly distributed across industries and geographies are the advances in data infrastructure and engineering practices?
What are some of the opportunities that are being missed or squandered during this dramatic shift in the data engineering landscape?
What are the most interesting, innovative, or unexpected ways that you have seen the data ecosytem evolve?
What are the most interesting, unexpected, or challenging lessons that you have learned while contributing to and participating in the data ecosystem?
In episode 3 of this show (almost five years ago) we closed with some predictions for the following years of data engineering, many of which have been proven out. What is your retrospective on those claims, and what are your new predictions for the upcoming years?
Contact Info
LinkedIn
@mistercrunch on Twitter
mistercrunch on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
How the Modern Data Stack is Reshaping Data Engineering
The Rise of the Data Engineer
The Downfall of the Data Engineer
Defining Data Engineering – Data Engineering Podcast
Airflow
Superset
Podcast Episode
Preset
Fivetran
Podcast Episode
Meltano
Podcast Episode
Airbyte
Podcast Episode
Ralph Kimball
Bill Inmon
Feature Store
Prophecy.io
Podcast Episode
Ab Initio
Dremio
Podcast Episode
Data Mesh
Podcast Episode
Firebolt
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 21, 2021 • 55min
Fast And Flexible Headless Data Analytics With Cube.JS
Summary
One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Artyom Keydunov and Pavel Tiunov about Cube.js a framework for building analytics APIs to power your applications and BI dashboards
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Cube is and the story behind it?
What are the main use cases and platform architectures that you are focused on?
Who are the target personas that will be using and managing Cube.js?
The name comes from the concept of an OLAP cube. Can you discuss the applications of OLAP cubes and their role in the current state of the data ecosystem?
How does the idea of an OLAP cube compare to the recent focus on a dedicated metrics layer?
What are the pieces of a data platform that might be replaced by Cube.js?
Can you describe the design and architecture of the Cube platform?
How has the focus and target use case for the Cube platform evolved since you first started working on it?
One of the perpetually hard problems in computer science is cache management. How have you approached that challenge in the pre-aggregation layer of the Cube framework?
What is your overarching design philosophy for the API of the Cube system?
Can you talk through the workflow of someone building a cube and querying it from a downstream system?
What do the iteration cycles look like as you go from initial proof of concept to a more sophisticated usage of Cube.js?
What are some of the data modeling steps that are needed in the source systems?
The perennial problem of embedding SQL into another host language or DSL is how to deal with validation and developer tooling. What are the utilities that you and the community have built to reduce friction while writing the definitions of a cube?
What are the methods available for maintaining visibility across all of the cubes defined within and across installations of Cube.js?
What are the opportunities for composing multiple cubes together to form a higher level aggregation?
What are the most interesting, innovative, or unexpected ways that you have seen Cube.js used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube?
When is Cube the wrong choice?
What do you have planned for the future of Cube?
Contact Info
Artom
keydunov on GitHub
@keydunov on Twitter
LinkedIn
Pavel
LinkedIn
@paveltiunov87 on Twitter
paveltiunov on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Cube.js
Statsbot
chart.js
Highcharts
D3
OLAP Cube
dbt
Superset
Podcast Episode
Streamlit
Podcast.__init__ Episode
Parquet
Hasura
kSQLDB
Podcast Episode
Materialize
Podcast Episode
Meltano
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 20, 2021 • 1h 6min
Building A System Of Record For Your Organization's Data Ecosystem At Metaphor
Summary
Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about Metaphor Data, a platform aiming to be the system of record for your data ecosystem
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Metaphor is and the story behind it?
On your site it states that you are aiming to be the "system of record" for your data platform. Can you unpack that statement and its implications?
What are the shortcomings in the "data catalog" approach to metadata collection and presentation?
Who are the target end users of Metaphor and what are the pain points for each persona that you are prioritizing?
How has that focus informed your priorities for user experience design and feature development?
Can you describe how the Metaphor platform is architected?
What are the lessons that you learned from your work at DataHub that have informed your work on Metaphor?
There has been a huge amount of focus on the "modern data stack" with an assumption that there is a cloud data warehouse as the central component that all data flows through. How does Metaphor’s design allow for usage in platforms that aren’t dominated by a cloud data warehouse?
What are some examples of information that you can extract through integrations with an organization’s communication platforms?
Can you talk through a few example workflows where that information is used to inform the actions taken by a team member?
What is your philosophy around data modeling or schema standardization for metadata records?
What are some of the challenges that teams face in stitching together a meaningful set of relations across metadata records in Metaphor?
What are some of the features or potential use cases for Metaphor that are overlooked or misunderstood as you work with your customers?
What are the most interesting, innovative, or unexpected ways that you have seen Metaphor used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaphor?
When is Metaphor the wrong choice?
What do you have planned for the future of Metaphor?
Contact Info
Pardhu
LinkedIn
@PardhuGunnam on Twitter
Mars
LinkedIn
mars-lan on GitHub
@mars_lan on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Metaphor
The Modern Metadata Platform
Why cant I find the right data?
DataHub
Transform
Podcast Episode
Supergrain
MetriQL
Podcast Episode
dbt
Podcast Interview
OpenMetadata
Podcast Interview
Pegasus Data Language
Modern Data Experience
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 13, 2021 • 42min
Building Auditable Spark Pipelines At Capital One
Summary
Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Gokul Prabagaren about how he is using Spark for real-world workflows at Capital One
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the types of data and workflows that you are responsible for at Capital one?
In terms of the three "V"s (Volume, Variety, Velocity), what is the magnitude of the data that you are working with?
What are some of the business and regulatory requirements that have to be factored into the solutions that you design?
Who are the consumers of the data assets that you are producing?
Can you describe the technical elements of the platform that you use for managing your data pipelines?
What are the various ways that you are using Spark at Capital One?
You wrote a post and presented at the Databricks conference about your experience moving from a data filtering to a data enrichment pattern for segmenting transactions. Can you give some context as to the use case and what your design process was for the initial implementation?
What were the shortcomings to that approach/business requirements which led you to refactoring the approach to one that maintained all of the data through the different processing stages?
What are some of the impacts on data volumes and processing latencies working with enriched data frames persisted between task steps?
What are some of the other optimizations or improvements that you have made to that pipeline since you wrote the post?
What are some of the limitations of Spark that you have experienced during your work at Capital One?
How have you worked around them?
What are the most interesting, innovative, or unexpected ways that you have seen Spark used at Capital One?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering at Capital One?
What are some of the upcoming projects that you are focused on/excited for?
How has your experience with the filtering vs. enrichment approach influenced your thinking on other projects that you work on?
Contact Info
@gocool_p on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Apache Spark
Blog Post
Databricks Presentation
Delta Lake
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 12, 2021 • 58min
Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform
Summary
The core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many sites and applications take that to the extreme and collect too much information. In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. In this episode he explains how it can be used to build rich and useful profiles of your users, the system architecture that powers it, and some of the ways that it is being integrated into an organization’s broader data ecosystem.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Serge Huber about Apache Unomi, an open source customer data platform designed to manage customers, leads and visitors data and help personalize customers experiences
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Unomi is and the story behind it?
What are the goals and target use cases of Unomi?
What are the aspects of collecting and aggregating profile information that present challenges to developers?
How does the design of Unomi reduce that burden?
How does the focus of Unomi compare to systems such as Segment/Rudderstack or Optimizely for collecting user interactions and applying personalization?
How does Unomi fit in the architecture of an application or data infrastructure?
Can you describe how Unomi itself is architected?
How have the goals and design of the project changed or evolved since it started?
What are some of the most complex or challenging engineering projects that you have worked through?
Can you describe the workflow of using Unomi to manage a set of customer profiles?
What are some examples of user experience customization that you can build with Unomi?
What are some alternative architectures that you have seen to produce similar capabilities?
One of the interesting features of Unomi is the end-user profile management. What are some of the system and developer challenges that are introduced by that capability? (e.g. constraints on data manipulation, security, privacy concerns, etc.)
How did Unomi manage privacy concerns and the GDPR ?
How does Unomi help with the new third party data restrictions ?
Why is access to raw data so important ?
Could cloud providers offer Unomi as a service ?
How have you used Unomi in your own work?
What are the most interesting, innovative, or unexpected ways that you have seen Unomi used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Unomi?
When is Unomi the wrong choice?
What do you have planned for the future of Unomi?
Contact Info
LinkedIn
@sergehuber on Twitter
@bhillou on Twitter
sergehuber on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Apache Unomi
Jahia
OASIS Open Foundation
Segment
Podcast Episode
Rudderstack
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 4, 2021 • 50min
Data Driven Hiring For Data Professionals With Alooba
Summary
Hiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to creep in. Tim Freestone founded Alooba to provide a more stable reference point for evaluating candidates to ensure that you can make more informed comparisons based on their actual knowledge. In this episode he explains how Alooba got started, how it is being used in the interview process for data oriented roles, and how it can also provide visibility into your organizations overall data literacy. The whole process of hiring is an important organizational skill to cultivate and this is an interesting exploration of the specific challenges involved in finding data professionals.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Tim Freestone about Alooba, an assessment platform for evaluating data and analytics candidates to improve hiring outcomes for data roles.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Alooba is and the story behind it?
What are the main goals that you are trying to achieve with Alooba?
What are the main challenges that employers and candidates face when navigating their respective roles in the hiring process?
What are some of the difficulties that are specific to data oriented roles?
What are some of the complexities involved in designing a user experience that is positive and productive for both candidates and companies?
What are some strategies that you have developed for establishing a fair and consistent baseline of skills to ensure consistent comparison across candidates?
One of the problems that comes from test-based skills assessment is the implicit bias toward candidates who test well. How do you work to mitigate that in the candidate evaluation process?
Can you describe how the Alooba platform itself is implemented?
How have the goals and design of the system changed or evolved since you first started it?
What are some of the ways that you use Alooba internally?
How do you stay up to date with the evolving skill requirements as roles change and new roles are created?
Beyond evaluation of candidates for hiring, what are some of the other features that you have added to Alooba to support organizations in their effort to gain value from their data?
What are the most interesting, innovative, or unexpected ways that you have seen Alooba used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alooba?
When is Alooba the wrong choice?
What do you have planned for the future of Alooba?
Contact Info
LinkedIn
@timmyfreestone on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Alooba
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 4, 2021 • 58min
Experimentation and A/B Testing For Modern Data Teams With Eppo
Summary
A/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate experiments is a complex process that requires a mix of technical capacity and organizational involvement which is hard to come by. Chetan Sharma founded Eppo to provide a system that organizations of every scale can use to reduce the burden of managing experiments so that you can focus on improving your business. In this episode he digs into the technical, statistical, and design requirements for running effective experiments and how he has architected the Eppo platform to make the process more accessible to business and data professionals.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Chetan Sharma about Eppo, a platform for building A/B experiments that are easier to manage
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Eppo is and the story behind it?
What are some examples of the kinds of experiments that teams and organizations might want to conduct?
What are the points of friction that
What are the steps involved in designing, deploying, and analyzing the outcomes of an A/B experiment?
What are some of the statistical errors that are common when conducting an experiment?
What are the design and UX principles that you have focused on in Eppo to improve the workflow of building and analyzing experiments?
Can you describe the system design of the Eppo platform?
What are the services or capabilities external to Eppo that are required for it to be effective?
What are the integration points for adding Eppo to an organization’s existing platform?
Beyond the technical capabilities for running experiments there are a number of design requirements involved. Can you talk through some of the decisions that need to be made when deciding what to change and how to measure its impact?
Another difficult element of managing experiments is understanding how they all interact with each other when running a large number of simultaneous tests. How does Eppo help with tracking the various experiments and the cohorts that are bucketed into each?
What are some of the ideas or assumptions that you had about the technical and design aspects of running experiments that have been challenged or changed while building Eppo?
What are the most interesting, innovative, or unexpected ways that you have seen Eppo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Eppo?
When is Eppo the wrong choice?
What do you have planned for the future of Eppo?
Contact Info
LinkedIn
@chesharma87 on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Eppo
Knowledge Repo
Apache Hive
Frequentist Statistics
Rudderstack
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast