

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Mar 30, 2021 • 58min
Data Quality Management For The Whole Team With Soda Data
Summary
Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable. They explain how they started down the path of building a solution for managing data quality, their philosophy of how to empower data engineers with well engineered open source tools that integrate with the rest of the platform, and how to bring all of the stakeholders onto the same page to make your data great. There are many aspects of data quality management and it’s always a treat to learn from people who are dedicating their time and energy to solving it for everyone.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Your host is Tobias Macey and today I’m interviewing Maarten Masschelein and Tom Baeyens about the work are doing at Soda to power data quality management
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what you are building at Soda?
What problem are you trying to solve?
And how are you solving that problem?
What motivated you to start a business focused on data monitoring and data quality?
The data monitoring and broader data quality space is a segment of the industry that is seeing a huge increase in attention recently. Can you share your perspective on the current state of the ecosystem and how your approach compares to other tools and products?
who have you created Soda for (e.g platform engineers, data engineers, data product owners etc) and what is a typical workflow for each of them?
How do you go about integrating Soda into your data infrastructure?
How has the Soda platform been architected?
Why is this architecture important?
How have the goals and design of the system changed or evolved as you worked with early customers and iterated toward your current state?
What are some of the challenges associated with the ongoing monitoring and testing of data?
what are some of the tools or techniques for data testing used in conjunction with Soda?
What are some of the most interesting, innovative, or unexpected ways that you have seen Soda being used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building the technology and business for Soda?
When is Soda the wrong choice?
What do you have planned for the future?
Contact Info
Maarten
LinkedIn
@masscheleinm on Twitter
Tom
LinkedIn
@tombaeyens on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Soda Data
Soda SQL
RedHat
Collibra
Spark
Getting Things Done by David Allen (affiliate link)
Slack
OpsGenie
DBT
Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 23, 2021 • 50min
Real World Change Data Capture At Datacoral
Summary
The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Your host is Tobias Macey and today I’m interviewing Raghu Murthy about his recent work of making change data capture more accessible and maintainable
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what CDC is and when it is useful?
What are the alternatives to CDC?
What are the cases where a more batch-oriented approach would be preferable?
What are the factors that you need to consider when deciding whether to implement a CDC system for a given data integration?
What are the barriers to entry?
What are some of the common mistakes or misconceptions about CDC that you have encountered in your own work and while working with customers?
How does CDC fit into a broader data platform, particularly where there are likely to be other data integration pipelines in operation? (e.g. Fivetran/Airbyte/Meltano/custom scripts)
What are the moving pieces in a CDC workflow that need to be considered as you are designing the system?
What are some examples of the configuration changes necessary in source systems to provide the needed log data?
How would you characterize the current landscape of tools available off the shelf for building a CDC pipeline?
What are your predictions about the potential for a unified abstraction layer for log-based CDC across databases?
What are some of the potential performance/uptime impacts on source databases, both during the initial historical sync and once you hit a steady state?
How can you mitigate the impacts of the CDC pipeline on the source databases?
What are some of the implementation details that application developers DBAs need to be aware of for data modeling in the source systems to allow for proper replication via CDC?
Are there any performance challenges that need to be addressed in the consumers or destination systems? e.g. parallelism
Can you describe the technical implementation and architecture that you use for implementing CDC?
How has the design evolved as you have grown the scale and sophistication of your system?
In the destination system, what data modeling decisions need to be made to ensure that the replicated information is usable for anlytics?
What additional attributes need to be added to track things like row modifications, deletions, schema changes, etc.?
How do you approach treatment of data copies in the DWH? (e.g. ELT – keep all source tables and use DBT for converting relevant tables into star/snowflake/data vault/wide tables)
What are your thoughts on the viability of a data lake as the destination system? (e.g. S3/Parquet or Trino/Drill/etc.)
CDC is a topic that is generally reserved for coversations about databases, but what are some of the other systems that we could think about implementing CDC? e.g. APIs and third party data sources
How can we integrage CDC into metadata/lineage tooling?
How do you handle observability of CDC flows?
What is involved in debugging a replication flow?
How can we build data quality checks into CDC workflows?
What are some of the most interesting, innovative, or unexpected ways that you have seen CDC used?
What are the most interesting, unexpected, or challenging lessons that you have learned from digging deep into CDC implementation?
When is CDC the wrong choice?
What are some of the industry or technology trends around CDC that you are most excited by?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
DataCoral
Podcast Episode
DataCoral Blog
3 Steps To Build A Modern Data Stack
Change Data Capture: Overview
Hive
Hadoop
DBT
Podcast Episode
FiveTran
Podcast Episode
Change Data Capture
Metadata First Blog Post
Debezium
Podcast Episode
UUID == Universally Unique Identifier
Airflow
Oracle Goldengate
Parquet
Trino
AWS Lambda
Data Mesh
Podcast Episode
Enterprise Message Bus
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 16, 2021 • 46min
Managing The DoorDash Data Platform
Summary
The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Your host is Tobias Macey and today I’m interviewing Sudhir Tonse about how the team at DoorDash designed their data platform
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving a quick overview of what you do at DoorDash?
What are some of the ways that data is used to power the business?
How has the pandemic affected the scale and volatility of the data that you are working with?
Can you describe the type(s) of data that you are working with?
What are the primary sources of data that you collect?
What secondary or third party sources of information do you rely on?
Can you give an overview of the collection process for that data?
In selecting the technologies for the various components in your data stack, what are the primary factors that you consider when evaluating the build vs. buy decision?
In your recent post about how you are scaling the capabilities and capacity of your data platform you mentioned the concept of maintaining a "paved path" of supported technologies to simplify integration across teams. What are the technologies that you use and rely on for the "paved path"?
How are you managing quality and consistency of your data across its lifecycle?
What are some of the specific data quality solutions that you have integrated into the platform and "paved path"?
What are some of the technologies that were used early on at DoorDash that failed to keep up as the business scaled?
How do you manage the migration path for adopting new technologies or techniques?
In the same post you mentioned the tendency to allow for building point solutions before deciding whether to generalize a given use case into a generalized platform capability. Can you give some examples of cases where a point solution remains a one-off versus when it needs to be expanded into a widely used component?
How do you identify and tracking cost factors in the data platform?
What do you do with that information?
What is your approach for identifying and measuring useful OKRs (Objectives and Key Results)?
How do you quantify potentially subjective metrics such as reliability and quality?
How have you designed the organizational structure for your data teams?
What are the responsibilities and organizational interfaces for data engineers within the company?
How have the organizational structures/patterns shifted or changed at different levels of scale/maturity for the business?
What are some of the most interesting, useful, unexpected, or challenging lessons that you have learned during your time as a data professional at DoorDash?
What are some of the upcoming projects or changes that you anticipate in the near to medium future?
Contact Info
LinkedIn
@stonse on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
How DoorDash is Scaling its Data Platform to Delight Customers and Meet our Growing Demand
DoorDash
Uber
Netscape
Netflix
Change Data Capture
Debezium
Podcast Episode
SnowflakeDB
Podcast Episode
Airflow
Podcast.__init__ Episode
Kafka
Flink
Podcast Episode
Pinot
GDPR
CCPA
Data Governance
AWS
LightGBM
XGBoost
Big Data Landscape
Kinesis
Kafka Connect
Cassandra
PostgreSQL
Podcast Episode
Amundsen
Podcast Episode
SQS
Feature Toggles
BigEye
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 9, 2021 • 52min
Leave Your Data Where It Is And Automate Feature Extraction With Molecula
Summary
A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Your host is Tobias Macey and today I’m interviewing H.O. Maycotte about Molecula, a cloud based feature store based on the open source Pilosa project
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what you are building at Molecula and the story behind it?
What are the additional capabilities that Molecula offers on top of the open source Pilosa project?
What are the problems/use cases that Molecula solves for?
What are some of the technologies or architectural patterns that Molecula might replace in a companies data platform?
One of the use cases that is mentioned on the Molecula site is as a feature store for ML and AI. This is a category that has been seeing a lot of growth recently. Can you provide some context how Molecula fits in that market and how it compares to options such as Tecton, Iguazio, Feast, etc.?
What are the benefits of using a bitmap index for identifying and computing features?
Can you describe how the Molecula platform is architected?
How has the design and goal of Molecula changed or evolved since you first began working on it?
For someone who is using Molecula, can you describe the process of integrating it with their existing data sources?
Can you describe the internal data model of Pilosa/Molecula?
How should users think about data modeling and architecture as they are loading information into the platform?
Once a user has data in Pilosa, what are the available mechanisms for performing analyses or feature engineering?
What are some of the most underutilized or misunderstood capabilities of Molecula?
What are some of the most interesting, unexpected, or innovative ways that you have seen the Molecula platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned from building and scaling Molecula?
When is Molecula the wrong choice?
What do you have planned for the future of the platform and business?
Contact Info
LinkedIn
@maycotte on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Molecula
Pilosa
Podcast Episode
The Social Dilemma
Feature Store
Cassandra
Elasticsearch
Podcast Episode
Druid
MongoDB
SwimOS
Podcast Episode
Kafka
Kafka Schema Registry
Podcast Episode
Homomorphic Encryption
Lucene
Solr
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 2, 2021 • 1h 6min
Bridging The Gap Between Machine Learning And Operations At Iguazio
Summary
The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Your host is Tobias Macey and today I’m interviewing Yaron Haviv about Iguazio, a platform for end to end automation of machine learning applications using MLOps principles.
Interview
Introduction
How did you get involved in the area of data science & analytics?
Can you start by giving an overview of what Iguazio is and the story of how it got started?
How would you characterize your target or typical customer?
What are the biggest challenges that you see around building production grade workflows for machine learning?
How does Iguazio help to address those complexities?
For customers who have already invested in the technical and organizational capacity for data science and data engineering, how does Iguazio integrate with their environments?
What are the responsibilities of a data engineer throughout the different stages of the lifecycle for a machine learning application?
Can you describe how the Iguazio platform is architected?
How has the design of the platform evolved since you first began working on it?
How have the industry best practices around bringing machine learning to production changed?
How do you approach testing/validation of machine learning applications and releasing them to production environments? (e.g. CI/CD)
Once a model is in production, what are the types and sources of information that you collect to monitor their performance?
What are the factors that contribute to model drift?
What are the remaining gaps in the tooling or processes available for managing the lifecycle of machine learning projects?
What are the most interesting, innovative, or unexpected ways that you have seen the Iguazio platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building and scaling the Iguazio platform and business?
When is Iguazio the wrong choice?
What do you have planned for the future of the platform?
Contact Info
LinkedIn
@yaronhaviv on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Iguazio
MLOps
Oracle Exadata
SAP HANA
Mellanox
NVIDIA
Multi-Model Database
Nuclio
MLRun
Jupyter Notebook
Pandas
Scala
Feature Imputing
Feature Store
Parquet
Spark
Apache Flink
Podcast Episode
Apache Beam
NLP (Natural Language Processing)
Deep Learning
BERT
Airflow
Podcast.__init__ Episode
Dagster
Data Engineering Podcast Episode
Podcast.__init__ Episode
Kubeflow
Argo
AWS Step Functions
Presto/Trino
Podcast Episode
Dask
Podcast Episode
Hadoop
Sagemaker
Tecton
Podcast Episode
Seldon
DataRobot
RapidMiner
H2O.ai
Grafana
Storey
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 23, 2021 • 52min
Self Service Open Source Data Integration With AirByte
Summary
Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Airbyte is and the story behind it?
Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space?
How would you characterize your target users?
How have those personas instructed the priorities and design of Airbyte?
What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?
what are the complex/challenging elements of data integration that makes it such a slippery problem?
motivation for creating open source ELT as a business
Can you describe how the Airbyte platform is implemented?
What was your motivation for choosing Java as the primary language?
incidental complexity of forcing all connectors to be packaged as containers
shortcomings of the Singer specification/motivation for creating a backwards incompatible interface
perceived potential for community adoption of Airbyte specification
tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.
information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)
interfaces/extension points for integrating with other tools, e.g. Dagster
abstraction layers for simplifying implementation of new connectors
tradeoffs of storing all connectors in a monorepo with the Airbyte core
impact of community adoption/contributions
What is involved in setting up an Airbyte installation?
What are the available axes for scaling an Airbyte deployment?
challenges of setting up and maintaining CI environment for Airbyte
How are you managing governance and long term sustainability of the project?
What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte?
When is Airbyte the wrong choice?
What do you have planned for the future of the project?
Contact Info
Michel
LinkedIn
@MichelTricot on Twitter
michel-tricot on GitHub
John
LinkedIn
@JeanLafleur on Twitter
johnlafleur on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Airbyte
Liveramp
Fivetran
Podcast Episode
Stitch Data
Matillion
DataCoral
Podcast Episode
Singer
Meltano
Podcast Episode
Airflow
Podcast.__init__ Episode
Kotlin
Docker
Monorepo
Airbyte Specification
Great Expectations
Podcast Episode
Dagster
Data Engineering Podcast Episode
Podcast.__init__ Episode
Prefect
Podcast Episode
DBT
Podcast Episode
Kubernetes
Snowflake
Podcast Episode
Redshift
Presto
Spark
Parquet
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 16, 2021 • 52min
Building The Foundations For Data Driven Businesses at 5xData
Tarush Aggarwal, Founder of 5xData, discusses the core elements necessary for businesses to be data-driven. He emphasizes the importance of building foundational capabilities, offering collaborative workshops to assist in setting up technical and organizational systems. He also highlights the ongoing support provided through mastermind groups and the initial steps for making data-informed decisions.

9 snips
Feb 9, 2021 • 47min
How Shopify Is Building Their Production Data Warehouse Using DBT
Summary
With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today.
Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what the Shopify platform is?
What kinds of data sources are you working with?
Can you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage?
How have you structured your data teams to be able to deliver those projects?
What are the systems that you have in place, technological or otherwise, to allow you to support the needs of the various data professionals and business users?
What was the tipping point that led you to reconsider your system design and start down the road of architecting a data warehouse?
What were your criteria when selecting a platform for your data warehouse?
What decision did that criteria lead you to make?
Once you decided to orient a large portion of your reporting around a data warehouse, what were the biggest unknowns that you were faced with while deciding how to structure the workflows and access policies?
What were your criteria for determining what toolchain to use for managing the data warehouse?
You ultimately decided to standardize on DBT. What were the other options that you explored and what were the requirements that you had for determining the candidates?
What was your process for onboarding users into the DBT toolchain and determining how to structure the project layout?
What are some of the shortcomings or edge cases that you ran into?
Rather than rely on the vanilla DBT workflow you created a wrapper project to add additional functionality. What were some of the features that you needed to add to suit your particular needs?
What has been your experience with extending and integrating with DBT to customize it for your environment?
Can you talk through how you manage testing of your DBT pipelines and the tables that it is responsible for?
How much of the testing are you able to do with out-of-the-box functionality from DBT?
What are the additional capabilities that you have bolted on to provide a more robust and scalable means of verifying your pipeline changes?
Can you share how you manage the CI/CD process for changes in your data warehouse?
What kinds of monitoring or metrics collection do you perform on the execution of your DBT pipelines?
How do you integrate the management of your data warehouse and DBT workflows with your broader data platform?
Now that you have been using DBT in production for a while, what are the challenges that you have encountered when using it at scale?
Are there any patterns that you and your team have found useful that are worth digging into for other teams who are considering DBT or are actively using it?
What are the opportunities and available mechanisms that you have found for introducing abstraction layers to reduce the maintenance burden for your data warehouse?
What is the data modeling approach that you are using? (e.g. Data Vault, Star/Snowflake Schema, wide tables, etc.)
As you continue to work with DBT and rely on the data warehouse for production use cases, what are some of the additional features/improvements that you have planned?
What are some of the unexpected/innovative/surprising use cases that you and your team have found for the Seamster tool or the data models that it generates?
What are the cases where you think that DBT or data warehousing is the wrong answer and teams should be looking to other solutions?
What are the most interesting, unexpected, or challenging lessons that you learned while working through the process of migrating a portion of your data workloads into the data warehouse and managing them with DBT?
Contact Info
Zeeshan
@zeeshanq on Twitter
Website
Michelle
@michellearky on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
How to Build a Production Grade Workflow with SQL Modelling
Shopify
JRuby
PySpark
Druid
Amplitude
Mode
Snowflake Schema
Data Vault
Podcast Episode
BigQuery
Amazon Redshift
CI/CD
Great Expectations
Podcast Episode
Master Data Management
Podcast Episode
Flink SQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 2, 2021 • 1h 5min
System Observability For The Cloud Native Era With Chronosphere
Summary
Collecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today.
Your host is Tobias Macey and today I’m interviewing Rob Skillington about Chronosphere, a scalable, reliable and customizable monitoring-as-a-service purpose built for cloud-native applications.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Chronosphere and your motivation for turning it into a business?
What are the biggest challenges inherent to monitoring use cases?
How does the advent of cloud native environments complicate things further?
While you were at Uber you helped to create the M3 storage engine. There are a wide array of time series databases available, including many purpose built for metrics use cases. What were the missing pieces that made it necessary to create a new system?
How do you handle schema design/data modeling for metrics storage?
How do the usage patterns of metrics systems contribute to the complexity of building a storage layer to support them?
What are the optimizations that need to be made for the read and write paths in M3?
How do you handle high cardinality of metrics and ad-hoc queries to understand system behaviors?
What are the scaling factors for M3?
Can you describe how you have architected the Chronosphere platform?
What are the convenience features built on top of M3 that you are creating at Chronosphere?
How do you handle deployment and scaling of your infrastructure given the scale of the businesses that you are working with?
Beyond just server infrastructure and application behavior, what are some of the other sources of metrics that you and your users are sending into Chronosphere?
How do those alternative metrics sources complicate the work of generating useful insights from the data?
In addition to the read and write loads, metrics systems also need to be able to identify patterns, thresholds, and anomalies in the data to alert on it with minimal latency. How do you handle that in the Chronosphere platform?
What are some of the most interesting, innovative, or unexpected ways that you have seen Chronosphere/M3 used?
What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Chronosphere?
When is Chronosphere the wrong choice?
What do you have planned for the future of the platform and business?
Contact Info
LinkedIn
@roskilli on Twitter
robskillington on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Chronosphere
Lidar
Cloud Native
M3DB
OpenTracing
Metrics/Telemetry
Graphite
Podcast.__init__ Episode
InfluxDB
Clickhouse
Podcast Episode
Prometheus
Inverted Index
Druid
Cardinality
Apache Flink
Podcast Episode
HDFS
Avro
Podcast Episode
Grafana
Tecton
Podcast Episode
Datadog
Podcast Episode
Kubernetes
Sourcegraph
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 26, 2021 • 34min
Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue
Summary
Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform. In this episode they describe the data integration challenges facing many B2B companies, how their work on the Hotglue platform simplifies their efforts, and how they have designed the platform to make these ETL workloads embeddable and self service for end users.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog.
Your host is Tobias Macey and today I’m interviewing David Molot and Hassan Syyid about Hotglue, an embeddable data integration tool for B2B developers built on the Python ecosystem.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Hotglue?
What was your motivation for starting a business to address this particular problem?
Who is the target user of Hotglue and what are their biggest data problems?
What are the types and sources of data that they are likely to be working with?
How are they currently handling solutions for those problems?
How does the introduction of Hotglue simplify or improve their work?
What is involved in getting Hotglue integrated into a given customer’s environment?
How is Hotglue itself implemented?
How has the design or goals of the platform evolved since you first began building it?
What were some of the initial assumptions that you had at the outset and how well have they held up as you progressed?
Once a customer has set up Hotglue what is their workflow for building and executing an ETL workflow?
What are their options for working with sources that aren’t supported out of the box?
What are the biggest design and implementation challenges that you are facing given the need for your product to be embedded in customer platforms and exposed to their end users?
What are some of the most interesting, innovative, or unexpected ways that you have seen Hotglue used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building Hotglue?
When is Hotglue the wrong choice?
What do you have planned for the future of the product?
Contact Info
David
@davidmolot on Twitter
LinkedIn
Hassan
hsyyid on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Hotglue
Python
The Python Podcast.__init__
B2B == Business to Business
Meltano
Podcast Episode
Airbyte
Singer
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast