

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Oct 30, 2022 • 40min
Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt
Summary
One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Nandam Karthik about his experiences building analytics projects with dbt and Optimus for his clients at Sigmoid.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Sigmoid is and the types of projects that you are involved in?
What are some of the core challenges that your clients are facing when they start working with you?
An ELT workflow with dbt as the transformation utility has become a popular pattern for building analytics systems. Can you share some examples of projects that you have built with this approach?
What are some of the ways that this pattern becomes bespoke as you start exploring a project more deeply?
What are the sharp edges/white spaces that you encountered across those projects?
Can you describe what Optimus is?
How does Optimus improve the user experience of teams working in dbt?
What are some of the tactical/organizational practices that you have found most helpful when building with dbt and Optimus?
What are the most interesting, innovative, or unexpected ways that you have seen Optimus/dbt used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt/Optimus projects?
When is Optimus/dbt the wrong choice?
What are your predictions for how "best practices" for analytics projects will change/evolve in the near/medium term?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Sigmoid
Optimus
dbt
Podcast Episode
Airflow
AWS Glue
BigQuery
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Datafold: 
Datafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.
Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold)
today to book a demo with Datafold.Support Data Engineering Podcast

6 snips
Oct 23, 2022 • 1h 12min
How To Bring Agile Practices To Your Data Projects
Summary
Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Shane Gibson about how to bring Agile practices to your data management workflows
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what AgileData is and the story behind it?
What are the main industries and/or use cases that you are focused on supporting?
The data ecosystem has been trying on different paradigms from software development for some time now (e.g. DataOps, version control, etc.). What are the aspects of Agile that do and don’t map well to data engineering/analysis?
One of the perennial challenges of data analysis is how to approach data modeling. How do you balance the need to provide value with the long-term impacts of incomplete or underinformed modeling decisions made in haste at the beginning of a project?
How do you design in affordances for refactoring of the data models without breaking downstream assets?
Another aspect of implementing data products/platforms is how to manage permissions and governance. What are the incremental ways that those principles can be incorporated early and evolved along with the overall analytical products?
What are some of the organizational design strategies that you find most helpful when establishing or training a team who is working on data products?
In order to have a useful target to work toward it’s necessary to understand what the data consumers are hoping to achieve. What are some of the challenges of doing requirements gathering for data products? (e.g. not knowing what information is available, consumers not understanding what’s hard vs. easy, etc.)
How do you work with the "customers" to help them understand what a reasonable scope is and translate that to the actual project stages for the engineers?
What are some of the perennial questions or points of confusion that you have had to address with your clients on how to design and implement analytical assets?
What are the most interesting, innovative, or unexpected ways that you have seen agile principles used for data?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData?
When is agile the wrong choice for a data project?
What do you have planned for the future of AgileData?
Contact Info
LinkedIn
@shagility on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
AgileData
OptimalBI
How To Make Toast
Data Mesh
Information Product Canvas
DataKitchen
Podcast Episode
Great Expectations
Podcast Episode
Soda Data
Podcast Episode
Google DataStore
Unfix.work
Activity Schema
Podcast Episode
Data Vault
Podcast Episode
Star Schema
Lean Methodology
Scrum
Kanban
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Atlan: 
Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?
Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.
Go to [dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan) and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.Prefect: 
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it.
Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit…

Oct 23, 2022 • 52min
Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB
Summary
The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Manjot Singh about MariaDB, one of the leading open source database engines
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what MariaDB is and the story behind it?
MariaDB started as a fork of the MySQL engine, what are the notable differences that have evolved between the two projects?
How have the MariaDB team worked to maintain compatibility for users who want to switch from MySQL?
What are the unique capabilities that MariaDB offers?
Beyond the core open source project you have built a suite of commercial extensions. What are the use cases/capabilities that you are targeting with those products?
How do you balance the time and effort invested in the open source engine against the commercial projects to ensure that the overall effort is sustainable?
What are your guidelines for what features and capabilities are released in the community edition and which are more suited to the commercial products?
For your managed cloud service, what are the differentiating factors for that versus the database services provided by the major cloud platforms?
What do you see as the future of the database market and how we interact and integrate with them?
What are the most interesting, innovative, or unexpected ways that you have seen MariaDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on MariaDB?
When is MariaDB the wrong choice?
What do you have planned for the future of MariaDB?
Contact Info
LinkedIn
@ManjotSingh on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
MariaDB
HTML Goodies
MySQL
PHP
MySQL/MariaDB Pluggable Storage
InnoDB
MyISAM
Aria Storage
SQL/PSM
MyRocks
MariaDB XPand
BSL == Business Source License
Paxos
MariaDB MongoDB Compatibility
Vertica
MariaDB Spider Storage Engine
IHME == Institute for Health Metrics and Evaluation
Rundeck
MaxScale
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 16, 2022 • 1h 3min
Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks
Summary
Logistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. In this episode Adrian Kosowski explains how the Pathway product got started, how its design simplifies the creation of data products that support supply chain operations, and how developers can help to build an ecosystem of applications that allow businesses to accelerate their time to insight.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Adrian Kosowski about Pathway, an AI powered database and streaming framework. Pathway is used for analyzing and optimizing supply chains and logistics in real-time.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Pathway is and the story behind it?
What are the primary challenges that you are working to solve?
Who are the target users of the Pathway product and how does it fit into their work?
Your tagline is that Pathway is "the database that thinks". What are some of the ways that existing database and stream-processing architectures introduce friction on the path to analysis?
How does Pathway incorporate computational capabilities into its engine to address those challenges?
What are the types of data that Pathway is designed to work with?
Can you describe how the Pathway engine is implemented?
What are some of the ways that the design and goals of the product have shifted since you started working on it?
What are some of the ways that Pathway can be integrated into an analytical system?
What is involved in adapting its capabilities to different industries?
What are the most interesting, innovative, or unexpected ways that you have seen Pathway used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pathway?
When is Pathway the wrong choice?
What do you have planned for the future of Pathway?
Contact Info
Adrian Kosowski
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Pathway
Pathway for developers
SPOJ.com – competitive programming community
Spatiotemporal Data
Pointers in programming
Clustering
The Halting Problem
Pytorch
Podcast.__init__ Episode
Tensorflow
Markov Chains
NetworkX
Finite State Machine
DTW == Dynamic Time Warping
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 16, 2022 • 51min
An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem
Summary
The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Dremio is and the story behind it?
What are some of the notable changes in the Dremio product and related ecosystem over the past ~4 years?
How has the advent of the lakehouse paradigm influenced the product direction?
What are the main benefits that a lakehouse design offers to a data platform?
What are some of the architectural patterns that are only possible with a lakehouse?
What is the distinction you make between a lakehouse and an open lakehouse?
What are some of the unique features that Dremio offers for lakehouse implementations?
What are some of the investments that Dremio has made to the broader open source/open lakehouse ecosystem?
How are those projects/investments being used in the commercial offering?
What is the purchase/usage model that customers expect for lakehouse implementations?
How have those expectations shifted since the first iterations of Dremio?
Dremio has its ancestry in the Drill project. How has that history influenced the capabilities (e.g. integrations, scalability, deployment models, etc.) and evolution of Dremio compared to systems like Trino/Presto and Spark SQL?
What are the most interesting, innovative, or unexpected ways that you have seen Dremio used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dremio?
When is Dremio the wrong choice?
What do you have planned for the future of Dremio?
Contact Info
Email
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Dremio
Podcast Episode
Dremio Sonar
Dremio Arctic
DML == Data Modification Language
Spark
Data Lake
Trino
Presto
Dremio Data Reflections
Tableau
Delta Lake
Podcast Episode
Apache Impala
Apache Arrow
DuckDB
Podcast Episode
Google BigLake
Project Nessie
Apache Iceberg
Podcast Episode
Hive Metastore
AWS Glue Catalog
Dremel
Apache Drill
Arrow Gandiva
dbt
Airbyte
Podcast Episode
Singer
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 10, 2022 • 55min
Making The Open Data Lakehouse Affordable Without The Overhead At Iomete
Summary
The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Vusal Dadalov about Iomete, an open and affordable lakehouse platform
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Iomete is and the story behind it?
The selection of the storage/query layer is the most impactful decision in the implementation of a data platform. What do you see as the most significant factors that are leading people to Iomete/lakehouse structures rather than a more traditional db/warehouse?
The principle of the Lakehouse architecture has been gaining popularity recently. What are some of the complexities/missing pieces that make its implementation a challenge?
What are the hidden difficulties/incompatibilities that come up for teams who are investing in data lake/lakehouse technologies?
What are some of the shortcomings of lakehouse architectures?
What are the fundamental capabilities that are necessary to run a fully functional lakehouse?
Can you describe how the Iomete platform is implemented?
What was your process for deciding which elements to adopt off the shelf vs. building from scratch?
What do you see as the strengths of Spark as the query/execution engine as compared to e.g. Presto/Trino or Dremio?
What are the integrations and ecosystem investments that you have had to prioritize to simplify adoption of Iomete?
What have been the most challenging aspects of building a competitive business in such an active product category?
What are the most interesting, innovative, or unexpected ways that you have seen Iomete used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iomete?
When is Iomete the wrong choice?
What do you have planned for the future of Iomete?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Iomete
Fivetran
Podcast Episode
Airbyte
Podcast Episode
Snowflake
Podcast Episode
Databricks
Collibra
Podcast Episode
Talend
Parquet
Trino
Spark
Presto
Snowpark
Iceberg
Podcast Episode
Iomete dbt adapter
Singer
Meltano
Podcast Episode
AWS Interface Gateway
Apache Hudi
Podcast Episode
Delta Lake
Podcast Episode
Amundsen
Podcast Episode
AWS EMR
AWS Athena
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 10, 2022 • 41min
Investing In Understanding The Customer Journey At American Express
Summary
For any business that wants to stay in operation, the most important thing they can do is understand their customers. American Express has invested substantial time and effort in their Customer 360 product to achieve that understanding. In this episode Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Purvi Shah about building the Customer 360 data product for American Express and migrating their enterprise data platform to the cloud
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what the Customer 360 project is and the story behind it?
What are the types of questions and insights that the C360 project is designed to answer?
Can you describe the types of information and data sources that you are relying on to feed this project?
What are the different axes of scale that you have had to address in the design and architecture of the C360 project? (e.g. geographical, volume/variety/velocity of data, scale of end-user access and data manipulation, etc.)
What are some of the challenges that you have had to address in order to build and maintain the map between organizational and technical requirements/semantics in the platform?
What were some of the early wins that you targeted, and how did the lessons from those successes drive the product design going forward?
Can you describe the platform architecture for your data systems that are powering the C360 product?
How have the design/goals/requirements of the system changed since you first started working on it?
How have you approached the integration and migration of legacy data systems and assets into this new platform?
What are some of the ongoing maintenance challenges that the legacy platforms introduce?
Can you describe how you have approached the question of data quality/observability and the validation/verification of the generated assets?
What are the aspects of governance and access control that you need to deal with being part of a financial institution?
Now that the C360 product has been in use for a few years, what are the strategic and tactical aspects of the ongoing evolution and maintenance of the product which you have had to address?
What are the most interesting, innovative, or unexpected ways that you have seen the C360 product used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on C360 for American Express?
When is a C360 project the wrong choice?
What do you have planned for the future of C360 and enterprise data platforms at American Express?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Data Stewards
Hadoop
SBA Paycheck Protection
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 3, 2022 • 56min
Make Data Lineage A Ubiquitous Part Of Your Work By Simplifying Its Implementation With Alvin
Summary
Data lineage is something that has grown from a convenient feature to a critical need as data systems have grown in scale, complexity, and centrality to business. Alvin is a platform that aims to provide a low effort solution for data lineage capabilities focused on simplifying the work of data engineers. In this episode co-founder Martin Sahlen explains the impact that easy access to lineage information can have on the work of data engineers and analysts, and how he and his team have designed their platform to offer that information to engineers and stakeholders in the places that they interact with data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Martin Sahlen about his work on data lineage at Alvin and how it factors into the day-to-day work of data engineers
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Alvin is and the story behind it?
What is the core problem that you are trying to solve at Alvin?
Data lineage has quickly become an overloaded term. What are the elements of lineage that you are focused on addressing?
What are some of the other sources/pieces of information that you integrate into the lineage graph?
How does data lineage show up in the work of data engineers?
In what ways does your focus on data engineers inform the way that you model the lineage information?
As with every data asset/product, the lineage graph is only as useful as the data that it stores. What are some of the ways that you focus on establishing and ensuring a complete view of lineage?
How do you account for assets (e.g. tables, dashboards, exports, etc.) that are created outside of the "officially supported" methods? (e.g. someone manually runs a SQL create statement, etc.)
Can you describe how you have implemented the Alvin platform?
How have the design and goals shifted from when you first started exploring the problem?
What are the types of data systems/assets that you are focused on supporting? (e.g. data warehouses vs. lakes, structured vs. unstructured, which BI tools, etc.)
How does Alvin fit into the workflow of data engineers and their downstream customers/collaborators?
What are some of the design choices (both visual and functional) that you focused on to avoid friction in the data engineer’s workflow?
What are some of the open questions/areas for investigation/improvement in the space of data lineage?
What are the factors that contribute to the difficulty of a truly holistic and complete view of lineage across an organization?
What are the most interesting, innovative, or unexpected ways that you have seen Alvin used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alvin?
When is Alvin the wrong choice?
What do you have planned for the future of Alvin?
Contact Info
LinkedIn
@martinsahlen on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Alvin
Unacast
sqlparse Python library
Cython
Podcast.__init__ Episode
Antlr
Kotlin programming language
PostgreSQL
Podcast Episode
OpenSearch
ElasticSearch
Redis
Kubernetes
Airflow
BigQuery
Spark
Looker
Mode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 3, 2022 • 1h
Gain Visibility And Insight Into Your Supply Chains Through Operational Analytics Powered By Roambee
Summary
The global economy is dependent on complex and dynamic networks of supply chains powered by sophisticated logistics. This requires a significant amount of data to track shipments and operational characteristics of materials and goods. Roambee is a platform that collects, integrates, and analyzes all of that information to provide companies with the critical insights that businesses need to stay running, especially in a time of such constant change. In this episode Roambee CEO, Sanjay Sharma, shares the types of questions that companies are asking about their logistics, the technical work that they do to provide ways to answer those questions, and how they approach the challenge of data quality in its many forms.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Sanjay Sharma about how Roambee is using data to bring visibility into shipping and supply chains.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Roambee is and the story behind it?
Who are the personas that are looking to Roambee for insights?
What are some of the questions that they are asking about the state of their assets?
Can you describe the types of information sources and the format of the data that you are working with?
What are the types of SLAs that you are focused on delivering to your customers? (e.g. latency from recorded event to analytics, accuracy, etc.)
Can you describe how the Roambee platform is implemented?
How have the evolving landscape of sensor and data technologies influenced the evolution of your service?
Given your support for customer-created integrations and user-generated inputs on shipment updates, how do you manage data quality and consistency?
How do you approach customer onboarding, and what is your approach to reducing the time to value?
What are the most interesting, innovative, or unexpected ways that you have seen the Roambee platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Roambee?
When is Roambee the wrong choice?
What do you have planned for the future of Roambee?
Contact Info
LinkedIn
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Roambee
RFID == Radio Frequency Identification
EDI == Electronic Data Interchange
Digital Twin
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 26, 2022 • 50min
Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations
Summary
Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Mark Van de Wiel about Fivetran’s implementation of change data capture and the state of streaming data integration in the modern data stack
Interview
Introduction
How did you get involved in the area of data management?
What are some of the notable changes/advancements at Fivetran in the last 3 years?
How has the scale and scope of usage for real-time data changed in that time?
What are some of the differences in usage for real-time CDC data vs. event streams that have been the driving force for a large amount of real-time data?
What are some of the architectural shifts that are necessary in an organizations data platform to take advantage of CDC data streams?
What are some of the shifts in e.g. cloud data warehouses that have happened/are happening to allow for ingestion and timely processing of these data feeds?
What are some of the different ways that CDC is implemented in different source systems?
What are some of the ways that CDC principles might start to bleed into e.g. APIs/SaaS systems to allow for more unified processing patterns across data sources?
What are some of the architectural/design changes that you have had to make to provide CDC for your customers at Fivetran?
What are the most interesting, innovative, or unexpected ways that you have seen CDC used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDC at Fivetran?
When is CDC the wrong choice?
What do you have planned for the future of CDC at Fivetran?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Fivetran
Podcast Episode
HVR Software
Change Data Capture
Debezium
Podcast Episode
LogMiner
Materialize
Podcast Episode
Kafka
Kinesis
dbt
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast