
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

Nov 27, 2021 • 59min
Creating A Unified Experience For The Modern Data Stack At Mozart Data
Summary
The modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to run a scalable, production grade data platform with a small team, but there are still sharp edges and integration challenges to work through. Peter Fishman and Dan Silberman experienced these difficulties firsthand and created Mozart Data to provide a single, easy to use option for getting started with the modern data stack. In this episode they explain how they designed a user experience to make working with data more accessibly by organizations without a data team, while allowing for more advanced users to build out more complex workflows. They also share their thoughts on the modern data ecosystem and how it improves the availability of analytics for companies of all sizes.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Peter Fishman and Dan Silberman about Mozart Data and how they are building a unified experience for the modern data stack
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Mozart Data is and the story behind it?
The promise of the "modern data stack" is that it’s all delivered as a service to make it easier to set up. What are the missing pieces that make something like Mozart necessary?
What are the main workflows or industries that you are focusing on?
Who are the main personas that you are building Mozart for?
How has that combination of user persona and industry focus informed your decisions around feature priorities and user experience?
Can you describe how you have architected the Mozart platform?
How have you approached the build vs. buy decision internally?
What are some of the most interesting or challenging engineering projects that you have had to work on while building Mozart?
What are the stages of the data lifecycle that you work the hardest to automate, and which do you focus on exposing to customers?
What are the edge cases in what customers might try to do in the bounds of Mozart, or areas where you have explicitly decided not to include in your features?
What are the options for extensibility, or custom engineering when customers encounter those situations?
What do you see as the next phase in the evolution of the data stack?
What are the most interesting, innovative, or unexpected ways that you have seen Mozart used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Mozart?
When is Mozart the wrong choice?
What do you have planned for the future of Mozart?
Contact Info
Peter
LinkedIn
@peterfishman on Twitter
Dan
LinkedIn
silberman on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Mozart Data
Modern Data Stack
Mode Analytics
Fivetran
Podcast Episode
Snowflake
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 27, 2021 • 59min
Doing DataOps For External Data Sources As A Service at Demyst
Summary
The data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The challenge comes in all of the operational aspects of finding, accessing, organizing, and serving that data. In this episode Mark Hookey discusses how he and his team at Demyst do all of the DataOps for external data sources so that you don’t have to, including the systems necessary to organize and catalog the various collections that they host, the various serving layers to provide query interfaces that match your platform, and the utility of having a single place to access a multitude of information. If you are having trouble answering questions for your business with the data that you generate and collect internally, then it is definitely worthwhile to explore the information available from external sources.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Mark Hookey about Demyst Data, a platform for operationalizing external data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Demyst is and the story behind it?
What are the services and systems that you provide for organizations to incorporate external sources in their data workflows?
Who are your target customers?
What are some examples of data sets that an organization might want to use in their analytics?
How are these different from SaaS data that an organization might integrate with tools such as Stitcher and Fivetran?
What are some of the challenges that are introduced by working with these external data sets?
If an organization isn’t using Demyst what are some of the technical and organizational systems that they will need to build and manage?
Can you describe how the Demyst platform is architected?
What have been the most complex or difficult engineering challenges that you have dealt with while building Demyst?
Given the wide variance in the systems that your customers are running, what are some strategies that you have used to provide flexible APIs for accessing the underlying information?
What is the process for you to identify and onboard a new data source in your platform?
What are some of the additional analytical systems that you have to run to manage your business (e.g. usage metering and analytics, etc.)?
What are the most interesting, innovative, or unexpected ways that you have seen Demyst used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Demyst?
When is Demyst the wrong choice?
What do you have planned for the future of Demyst?
Contact Info
LinkedIn
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Demyst Data
LexisNexis
AWS Athena
DataRobot
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

53 snips
Nov 20, 2021 • 1h 5min
Laying The Foundation Of Your Data Platform For The Era Of Big Complexity With Dagster
Summary
The technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform and blazing fast NVMe storage there’s nothing slowing you down. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Nick Schrock about the evolution of Dagster and its path forward
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Dagster is and the story behind it?
How has the project and community changed/evolved since we last spoke 2 years ago?
How has the experience of the past 2 years clarified the challenges and opportunities that exist in the data ecosystem?
What do you see as the foundational vs transient complexities that are germane to the industry?
One of the emerging ideas in Dagster is the "software defined data asset" as the central entity in the framework. How has that shifted the way that engineers approach pipeline design and composition?
How did that conceptual shift inform the accompanying refactor of the core principles in the framework? (jobs, ops, graphs)
One of the powerful elements of the Dagster framework is the investment in rich metadata as a foundational principle. What are the opportunities for integrating and extending that context throughout the rest of an organizations data platform?
What do you see as the potential for efforts such as OpenLineage and OpenMetadata to allow for other components in the data platform to create and propagate that context more freely?
What are some of the project architecture/repository structure/pipeline composition patterns that have begun to form in the community and your own internal work with Dagster?
What are some of the anti-patterns that you have seen users fall into when working with Dagster?
Along with your recent refactoring of the core API you have also started to roll out the Dagster Cloud offering. What was your process for determining the path to commercialization for the Dagster project and community?
How are you managing governance and long-term viability of the open source elements of Dagster?
What are your design principles for deciding the boundaries between OSS and commercial features?
What do you see as the role of Dagster in the creation of a data platform architecture?
What are the opportunities that it creates for data platform engineers?
What is your perspective on the tradeoffs of pipelines as software vs. pipelines as "code" vs. low/no-code pipelines?
What (if any) option do you see for language agnostic/multi-language pipeline definitions in Dagster?
What do you see as the biggest threats to the future success of Dagster/Elementl?
You were a relative outsider to the data ecosystem when you first started Dagster/Elementl. What have been the most interesting and surprising experiences as you have invested your time and energy in contributing to the community?
What are the most interesting, innovative, or unexpected ways that you have seen Dagster used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dagster?
When is Dagster the wrong choice?
What do you have planned for the future of Dagster?
Contact Info
LinkedIn
@schrockn on Twitter
schrockn on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Elementl
Series A Announcement
Video on software-defined assets
Dagster
Podcast Episode
GraphQL
dbt
Podcast Episode
Open Source Data Stack Conference
Meltano
Podcast Episode
Amundsen
Podcast Episode
DataHub
Podcast Episode
Hashicorp
Vercel
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 20, 2021 • 53min
Exploring Processing Patterns For Streaming Data Integration In Your Data Lake
Summary
One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Ori Rafael about strategies for building stream and batch processing patterns for data lake analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the state of the market for data lakes today?
What are the prevailing architectural and technological patterns that are being used to manage these systems?
Batch and streaming systems have been used in various combinations since the early days of Hadoop. The Lambda architecture has largely been abandoned, so what is the answer for today’s data lakes?
What are the challenges presented by streaming approaches to data transformations?
The batch model for processing is intuitive despite its latency problems. What are the benefits that it provides?
The core concept for data orchestration is the DAG. How does that manifest in a streaming context?
In batch processing idempotent/immutable datasets are created by re-running the entire pipeline when logic changes need to be made. Given that there is no definitive start or end of a stream, what are the options for amending logical errors in transformations?
What are some of the data processing/integration patterns that are impossible in a batch system?
What are some useful strategies for migrating from a purely batch, or hybrid batch and streaming architecture, to a purely streaming system?
What are some of the changes in technological or organizational patterns that are often overlooked or misunderstood in this shift?
What are some of the most surprising things that you have learned about streaming systems in your time at Upsolver?
What are the most interesting, innovative, or unexpected ways that you have seen streaming architectures used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on streaming data integration?
When are streaming architectures the wrong approach?
What do you have planned for the future of Upsolver to make streaming data easier to work with?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Upsolver
Hive Metastore
Hudi
Podcast Episode
Iceberg
Podcast Episode
Hadoop
Lambda Architecture
Kappa Architecture
Apache Beam
Event Sourcing
Flink
Podcast Episode
Spark Structured Streaming
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 14, 2021 • 59min
Data Quality Starts At The Source
Summary
The most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in defining, monitoring, and enforcing data quality metrics. In this episode Michael Harper advocates for proactive data quality and starting with the source, rather than being reactive and having to work backwards from when a problem is found.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Michael Harper about definitions of data quality and where to define and enforce it in the data platform
Interview
Introduction
How did you get involved in the area of data management?
What is your definition for the term "data quality" and what are the implied goals that it embodies?
What are some ways that different stakeholders and participants in the data lifecycle might disagree about the definitions and manifestations of data quality?
The market for "data quality tools" has been growing and gaining attention recently. How would you categorize the different approaches taken by open source and commercial options in the ecosystem?
What are the tradeoffs that you see in each approach? (e.g. data warehouse as a chokepoint vs quality checks on extract)
What are the difficulties that engineers and stakeholders encounter when identifying and defining information that is necessary to identify issues in their workflows?
Can you describe some examples of adding data quality checks to the beginning stages of a data workflow and the kinds of issues that can be identified?
What are some ways that quality and observability metrics can be aggregated across multiple pipeline stages to identify more complex issues?
In application observability the metrics across multiple processes are often associated with a given service. What is the equivalent concept in data platform observabiliity?
In your work at Databand what are some of the ways that your ideas and assumptions around data quality have been challenged or changed?
What are the most interesting, innovative, or unexpected ways that you have seen Databand used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working at Databand?
When is Databand the wrong choice?
What do you have planned for the future of Databand?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Databand
Clean Architecture (affiliate link)
Great Expectations
Deequ
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 10, 2021 • 1h 7min
Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata
Summary
A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. After experiencing the impacts of fragmented metadata and previous attempts at building a solution Suresh Srinivas and Sriharsha Chintalapani created the OpenMetadata project. In this episode they share the lessons that they have learned through their previous attempts and the positive impact that a unified metadata layer had during their time at Uber. They also explain how the OpenMetadat project is aiming to be a common standard for defining and storing metadata for every use case in data platforms and the ways that they are architecting the reference implementation to simplify its adoption. This is an ambitious and exciting project, so listen and try it out today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Sriharsha Chintalapani and Suresh Srinivas about OpenMetadata, an open standard for metadata and a reference implementation for a central metadata store
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what the OpenMetadata project is and the story behind it?
What are the goals of the project?
What are the common challenges faced by engineers and data practitioners in organizing the metadata for their systems?
What are the capabilities that a centralized and holistic view of a platform’s metadata can enable?
How would you characterize the current state and progress on the open source initiative around OpenMetadata?
How does OpenMetadata compare to the OpenLineage project and other similar systems?
What opportunities do you see for collaborating with or learning from their efforts?
What are the schema elements that you have identified as critical to a holistic view of an organization’s metadata?
For an organization with an existing data platform, what is the role that OpenMetadata plays, and what are the points of integration across the different components?
Can you describe the implementation of the OpenMetadata architecture?
What are the user experience and operational characteristics that you are trying to optimize for as you iterate on the project?
What are the challenges that you face in balancing the generality and specificity of the core schemas for metadata objects?
There are a large and growing number of businesses that create systems on top of an organizations metadata in the form of catalogs, observability, governance, data quality, etc. What do you see as the role of the OpenMetadata project across that ecosystem of products?
How has your perspective on the domain of metadata management and the associated challenges changed or evolved as you have been working on this project?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata?
When is OpenMetadata the wrong choice?
What do you have planned for the future of OpenMetadata?
Contact Info
Suresh
LinkedIn
@suresh_m_s on Twitter
sureshms on GitHub
Sriharsha
LinkedIn
harshach on GitHub
@d3fmacro on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
OpenMetadata
Apache Storm
Apache Kafka
Hortonworks
Apache Atlas
OpenMetadata Sandbox
OpenLineage
Podcast Episode
Egeria
JSON Schema
Amundsen
Podcast Episode
DataHub
Podcast Episode
JanusGraph
Titan Graph Database
HBase
Jetty
DropWizard
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 6, 2021 • 1h 2min
Business Intelligence Beyond The Dashboard With ClicData
Summary
Business intelligence is often equated with a collection of dashboards that show various charts and graphs representing data for an organization. What is overlooked in that characterization is the level of complexity and effort that are required to collect and present that information, and the opportunities for providing those insights in other contexts. In this episode Telmo Silva explains how he co-founded ClicData to bring full featured business intelligence and reporting to every organization without having to build and maintain that capability on their own. This is a great conversation about the technical and organizational operations involved in building a comprehensive business intelligence system and the current state of the market.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Telmo Silva about ClicData,
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what ClicData is and the story behind it?
How would you characterize the current state of the market for business intelligence?
What are the systems/capabilities that are required to run a full-featured BI system?
What are the challenges that businesses face in developing in-house capacity for business intelligence?
Can you describe how the ClicData platform is architected?
How has it changed or evolved since you first began working on it?
How are you approaching schema design and evolution in the storage layer?
How do you handle questions of data security/privacy/regulations given that you are storing the information on behalf of the business?
In your work with clients what are some of the challenges that businesses are facing when attempting to answer questions and gain insights from their data in a repeatable fashion?
What are some strategies that you have found useful for structuring schemas or dashboards to make iterative exploration of data effective?
What are the most interesting, innovative, or unexpected ways that you have seen ClicData used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on ClicData?
When is ClicData the wrong choice?
What do you have planned for the future of ClicData?
Contact Info
LinkedIn
@telmo_clicdata on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
ClicData
Tableau
Superset
Podcast Episode
Pentaho
D3.js
Informatica
Talend
TIBCO Spotfire
Looker
Podcast Episode
Bullet Chart
PostgreSQL
Podcast Episode
Azure
Crystal Reports
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 5, 2021 • 1h 2min
Exploring The Evolution And Adoption of Customer Data Platforms and Reverse ETL
Summary
The precursor to widespread adoption of cloud data warehouses was the creation of customer data platforms. Acting as a centralized repository of information about how your customers interact with your organization they drove a wave of analytics about how to improve products based on actual usage data. A natural outgrowth of that capability is the more recent growth of reverse ETL systems that use those analytics to feed back into the operational systems used to engage with the customer. In this episode Tejas Manohar and Rachel Bradley-Haas share the story of their own careers and experiences coinciding with these trends. They also discuss the current state of the market for these technological patterns and how to take advantage of them in your own work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Go to dataengineeringpodcast.com/montecarlo and start trusting your data with Monte Carlo today!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Rachel Bradley-Haas and Tejas Manohar about the combination of operational analytics and the customer data platform
Interview
Introduction
How did you get involved in the area of data management?
Can we start by discussing what it means to have a "customer data platform"?
What are the challenges that organizations face in establishing a unified view of their customer interactions?
How do the presence of multiple product lines impact the ability to understand the relationship with the customer?
We have been building data warehouses and business intelligence systems for decades. How does the idea of a CDP differ from the approaches of those previous generations?
A recent outgrowth of the focus on creating a CDP is the introduction of "operational analytics", which was initially termed "reverse ETL". What are your opinions on the semantics and importance of these names?
What is the relationship between a CDP and operational analytics? (can you have one without the other?)
How have the capabilities of operational analytics systems changed or evolved in the past couple of years?
What new use cases or capabilities have been unlocked as a result of these changes?
What are the opportunities over the medium to long term for operational analytics and customer data platforms?
What are the most interesting, innovative, or unexpected ways that you have seen operational analytics and CDPs used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on operational analytics?
When is a CDP the wrong choice?
What other industry trends are you keeping an eye on? What do you anticipate will be the next breakout product category?
Contact Info
Rachel
LinkedIn
Tejas
LinkedIn
@tejasmanohar on Twitter
tejasmanohar on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Big-Time Data
Hightouch
Podcast Episode
Segment
Podcast Episode
Customer Data Platform
Treasure Data
Rudderstack
Airflow
DBT Cloud
Fivetran
Podcast Episode
Stitch
PLG == Product Led Growth
ABM == Account Based Marketing
Materialize
Podcast Episode
Transform
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 29, 2021 • 1h 9min
Removing The Barrier To Exploratory Analytics with Activity Schema and Narrator
Summary
The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Ahmed Elsamadisi about Narrator, a platform to enable anyone to go from question to data-driven decision in minutes
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Narrator is and the story behind it?
What are the challenges that you have seen organizations encounter when attempting to make analytics a self-serve capability?
What are the use cases that you are focused on?
How does Narrator fit within the data workflows of an organization?
How is the Narrator platform implemented?
How has the design and focus of the technology evolved since you first started working on Narrator?
The core element of the analyses that you are building is the "activity schema". Can you describe the design process that led you to that format?
What are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault?
How does the activity schema address those challenges?
What are the performance characteristics of deriving models from an activity schema/timeseries table?
For someone who wants to use Narrator, what is involved in transforming their data to map into the activity schema?
Can you talk through the domain modeling that needs to happen when determining what entities and actions to capture?
What are the most interesting, innovative, or unexpected ways that you have seen Narrator used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Narrator?
When is Narrator the wrong choice?
What do you have planned for the future of Narrator?
Contact Info
LinkedIn
@ae4ai on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Narrator
DARPA Challenge
Fivetran
Luigi
Chartio
Airflow
Domain Driven Design
Data Vault
Snowflake Schema
Event Sourcing
Census
Podcast Episode
Hightouch
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 29, 2021 • 1h 10min
Streaming Data Pipelines Made SQL With Decodable
Summary
Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Eric Sammer about Decodable, a platform for simplifying the work of building real-time data pipelines
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Decodable is and the story behind it?
Who are the target users, and how has that focus informed your prioritization of features at launch?
What are the complexities that data engineers encounter when building pipelines on streaming systems?
What are the distributed systems concepts and design optimizations that are often skipped over or misunderstood by engineers who are using them? (e.g. backpressure, exactly once semantics, isolation levels, etc.)
How do those mismatches in understanding and expectation impact the correctness and reliability of the workflows that they are building?
Can you describe how you have architected the Decodable platform?
What have been the most complex or time consuming engineering challenges that you have dealt with so far?
What are the points of integration that you expose for engineers to wire in their existing infrastructure and data systems?
What has been your process for designing the interfaces and abstractions that you are exposing to end users?
What are some of the leaks in those abstractions that have either started to show or are anticipated?
What have you learned about the state of data engineering and the costs and benefits of real-time data while working on Decodable?
What are the most interesting, innovative, or unexpected ways that you have seen Decodable used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable?
When is Decodable the wrong choice?
What do you have planned for the future of Decodable?
Contact Info
esammer on GitHub
@esammer on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Decodable
Cloudera
Kafka
Flink
Podcast Episode
Spark
Snowflake
Podcast Episode
BigQuery
RedShift
kSQLDB
Podcast Episode
dbt
Podcast Episode
Millwheel Paper
Dremel Paper
Timely Dataflow
Materialize
Podcast Episode
Software Defined Networking
Data Mesh
Podcast Episode
OpenLineage
Podcast Episode
DataHub
Podcast Episode
Amundsen
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast