

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

May 2, 2022 • 53min
Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion
Summary
The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
Your host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Matillion is and the story behind it?
What are the use cases and user personas that you are focused on supporting?
How does that influence the focus and pace of your feature development and priorities?
How is Matillion architected?
How have the design and goals of the system changed since you started working on it?
The ecosystems of both cloud technologies and data processing have been rapidly growing and evolving, with new patterns and paradigms being introduced. What are the elements of your product focus and messaging that you have had to update and what are the core principles that have stayed the same?
What have been the most challenging integrations to build and support?
What is a typical workflow for integrating Matillion into an organization and building a set of pipelines?
What are some of the patterns that have been useful for managing incidental complexity as usage scales?
What are the most interesting, innovative, or unexpected ways that you have seen Matillion used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Matillion?
When is Matillion the wrong choice?
What do you have planned for the future of Matillion?
Contact Info
LinkedIn
Matillion Contact
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Matillion
Twitter
IBM DB2
Cognos
Talend
Redshift
AWS Marketplace
AWS Re:Invent
Azure
GCP == Google Cloud Platform
Informatica
SSIS == SQL Server Integration Services
PCRE == Perl Compatible Regular Expressions
Teradata
Tomcat
Collibra
Alation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

May 2, 2022 • 1h 4min
Evolving And Scaling The Data Platform at Yotpo
Summary
Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Doron Porat and Liran Yogev about their experiences designing and implementing a self-serve data platform at Yotpo
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Yotpo is and the role that data plays in the organization?
What are the core data types and sources that you are working with?
What kinds of data assets are being produced and how do those get consumed and re-integrated into the business?
What are the user personas that you are supporting and what are the interfaces that they are comfortable interacting with?
What is the size of your team and how is it structured?
You recently posted about the current architecture of your data platform. What was the starting point on your platform journey?
What did the early stages of feature and platform evolution look like?
What was the catalyst for making a concerted effort to integrate your systems into a cohesive platform?
What was the scope and directive of the project for building a platform?
What are the metrics and capabilities that you are optimizing for in the structure of your data platform?
What are the organizational or regulatory constraints that you needed to account for?
What are some of the early decisions that affected your available choices in later stages of the project?
What does the current state of your architecture look like?
How long did it take to get to where you are today?
What were the factors that you considered in the various build vs. buy decisions?
How did you manage cost modeling to understand the true savings on either side of that decision?
If you were to start from scratch on a new data platform today what might you do differently?
What are the decisions that proved helpful in the later stages of your platform development?
What are the most interesting, innovative, or unexpected ways that you have seen your platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing your platform?
What do you have planned for the future of your platform infrastructure?
Contact Info
Doron
LinkedIn
Liran
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Yotpo
Data Platform Architecture Blog Post
Greenplum
Databricks
Metorikku
Apache Hive
CDC == Change Data Capture
Debezium
Podcast Episode
Apache Hudi
Podcast Episode
Upsolver
Podcast Episode
Spark
PrestoDB
Snowflake
Podcast Episode
Druid
Rockset
Podcast Episode
dbt
Podcast Episode
Acryl
Podcast Episode
Atlan
Podcast Episode
OpenLineage
Podcast Episode
Okera
Shopify Data Warehouse Episode
Redshift
Delta Lake
Podcast Episode
Iceberg
Podcast Episode
Outbox Pattern
Backstage
Roadie
Nomad
Kubernetes
Deequ
Great Expectations
Podcast Episode
LakeFS
Podcast Episode
2021 Recap Episode
Monte Carlo
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Apr 24, 2022 • 1h 11min
Operational Analytics At Speed With Minimal Busy Work Using Incorta
Summary
A huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
Your host is Tobias Macey and today I’m interviewing Matthew Halliday about Incorta, an in-memory, unified data and analytics platform as a service
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Incorta is and the story behind it?
What are the use cases and customers that you are focused on?
How does that focus inform the design and priorities of functionality in the product?
What are the technologies and workflows that Incorta might replace?
What are the systems and services that it is intended to integrate with and extend?
Can you describe how Incorta is implemented?
What are the core technological decisions that were necessary to make the product successful?
How have the design and goals of the system changed and evolved since you started working on it?
Can you describe the workflow for building an end-to-end analysis using Incorta?
What are some of the new capabilities or use cases that Incorta enables which are impractical or intractable with other combinations of tools in the ecosystem?
How do the features of Incorta influence the approach that teams take for data modeling?
What are the points of collaboration and overlap between organizational roles while using Incorta?
What are the most interesting, innovative, or unexpected ways that you have seen Incorta used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Incorta?
When is Incorta the wrong choice?
What do you have planned for the future of Incorta?
Contact Info
LinkedIn
@layereddelay on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Incorta
3rd Normal Form
Parquet
Podcast Episode
Delta Lake
Podcast Episode
Iceberg
Podcast Episode
PrestoDB
PySpark
Dataiku
Angular
React
Apache ECharts
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 24, 2022 • 59min
Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs
Summary
There are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Andy Dang about powering observability of AI systems with the whylogs data logging library
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Whylabs is and the story behind it?
How is "data logging" differentiated from logging for the purpose of debugging and observability of software logic?
What are the use cases that you are aiming to support with Whylogs?
How does it compare to libraries and services like Great Expectations/Monte Carlo/Soda Data/Datafold etc.
Can you describe how Whylogs is implemented?
How have the design and goals of the project changed or evolved since you started working on it?
How do you maintain feature parity between the Python and Java integrations?
How do you structure the log events and metadata to provide detail and context for data applications?
How does that structure support aggregation and interpretation/analysis of the log information?
What is the process for integrating Whylogs into an existing project?
Once you have the code instrumented with log events, what is the workflow for using Whylogs to debug and maintain a data application?
What have you found to be useful heuristics for identifying what to log?
What are some of the strategies that teams can use to maintain a balance of signal vs. noise in the events that they are logging?
How is the Whylogs governance set up and how are you approaching sustainability of the open source project?
What are the additional utilities and services that you anticipate layering on top of/integrating with Whylogs?
What are the most interesting, innovative, or unexpected ways that you have seen Whylogs used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Whylabs?
When is Whylogs/Whylabs the wrong choice?
What do you have planned for the future of Whylabs?
Contact Info
LinkedIn
@andy_dng on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Whylogs
Whylabs
Spark
Airflow
Pandas
Podcast Episode
Data Sketches
Grafana
Great Expectations
Podcast Episode
Monte Carlo
Podcast Episode
Soda Data
Podcast Episode
Datafold
Podcast Episode
Delta Lake
Podcast Episode
HyperLogLog
MLFlow
Flyte
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 18, 2022 • 40min
Connecting To The Next Frontier Of Computing With Quantum Networks
Summary
The next paradigm shift in computing is coming in the form of quantum technologies. Quantum procesors have gained significant attention for their speed and computational power. The next frontier is in quantum networking for highly secure communications and the ability to distribute across quantum processing units without costly translation between quantum and classical systems. In this episode Prineha Narang, co-founder and CTO of Aliro, explains how these systems work, the capabilities that they can offer, and how you can start preparing for a post-quantum future for your data systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Your host is Tobias Macey and today I’m interviewing Dr. Prineha Narang about her work at Aliro building quantum networking technologies and how it impacts the capabilities of data systems
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Aliro is and the story behind it?
What are the use cases that you are focused on?
What is the impact of quantum networks on distributed systems design? (what limitations does it remove?)
What are the failure modes of quantum networks?
How do they differ from classical networks?
How can network technologies bridge between classical and quantum connections and where do those transitions happen?
What are the latency/bandwidth capacities of quantum networks?
How does it influence the network protocols used during those communications?
How much error correction is necessary during the quantum communication stages of network transfers?
How does quantum computing technology change the landscape for AI technologies?
How does that impact the work of data engineers who are building the systems that power the data feeds for those models?
What are the most interesting, innovative, or unexpected ways that you have seen quantum technologies used for data systems?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aliro and your academic research?
When are quantum technologies the wrong choice?
What do you have planned for the future of Aliro and your research efforts?
Contact Info
LinkedIn
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Aliro Quantum
Harvard University
CalTech
Quantum Computing
Quantum Repeater
ARPANet
Trapped Ion Quantum Computer
Photonic Computing
SDN == Software Defined Networking
QPU == Quantum Processing Unit
IEEE
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 16, 2022 • 1h 16min
What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?
Summary
Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing Demetrios Brinkmann and David Aponte about what you need to know about MLOps as a data engineer
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what MLOps is?
How does it relate to DataOps? DevOps? (is it just another buzzword?)
What is your interest and involvement in the space of MLOps?
What are the open and active questions in the MLOps community?
Who is responsible for MLOps in an organization?
What is the role of the data engineer in that process?
What are the core capabilities that are necessary to support an "MLOps" workflow?
How do the current platform technologies support the adoption of MLOps workflows?
What are the areas that are currently underdeveloped/underserved?
Can you describe the technical and organizational design/architecture decisions that need to be made when endeavoring to adopt MLOps practices?
What are some of the common requirements for supporting ML workflows?
What are some of the ways that requirements become bespoke to a given organization or project?
What are the opportunities for standardization or consolidation in the tooling for MLOps?
What are the pieces that are always going to require custom engineering?
What are the most interesting, innovative, or unexpected approaches to MLOps workflows/platforms that you have seen?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on supporting the MLOps community?
What are your predictions for the future of MLOps?
What are you keeping a close eye on?
Contact Info
Demetrios
LinkedIn
@Dpbrinkm on Twitter
Medium
David
LinkedIn
@aponteanalytics on Twitter
aponte411 on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
MLOps Community
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz (affiliate link)
MLOps
DataOps
DevOps
The Sequence Newsletter
Neptune.ai
Algorithmia
Kubeflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 11, 2022 • 58min
DataOps As A Service For Your Data Integration Workflows With Rivery
Summary
Data engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Itamar Ben Hemo about Rivery, a SaaS platform designed to provide an end-to-end solution for Ingestion, Transformation, Orchestration, and Data Operations
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Rivery is and the story behind it?
What are the primary goals of Rivery as a platform and company?
What are the target personas for the Rivery platform?
What are the points of interaction/workflows for each of those personas?
What are some of the positive and negative sources of inspiration that you looked to while deciding on the scope of the platform?
The majority of recently formed companies are focused on narrow and composable concerns of data management. What do you see as the shortcomings of that approach?
What are some of the tradeoffs between integrating independent tools vs buying into an ecosystem?
How is the Rivery platform designed and implemented?
How have the design and goals of the platform changed or evolved since you began working on it?
What were your criteria for the MVP that would allow you to test your hypothesis?
How has the evolution of the ecosystem influenced your product strategy?
One of the interesting features that you offer is the catalog of "kits" to quickly set up common workflows. How do you manage regression/integration testing for those kits as the Rivery platform evolves?
What are the most interesting, innovative, or unexpected ways that you have seen Rivery used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rivery?
When is Rivery the wrong choice?
What do you have planned for the future of Rivery?
Contact Info
LinkedIn
@ItamarBenHemo on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Rivery
Matillion
BigQuery
Snowflake
Podcast Episode
dbt
Podcast Episode
Fivetran
Podcast Episode
Snowpark
Postman
Debezium
Podcast Episode
Snowflake Partner Connect
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 10, 2022 • 49min
Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel
Summary
Any time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing John Myers about privacy engineering and use cases for synthetic data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Gretel is and the story behind it?
How do you define "privacy engineering"?
In an organization or data team, who is typically responsible for privacy engineering?
How would you characterize the current state of the art and adoption for privacy engineering?
Who are the target users of Gretel and how does that inform the features and design of the product?
What are the stages of the data lifecycle where Gretel is used?
Can you describe a typical workflow for integrating Gretel into data pipelines for business analytics or ML model training?
How is the Gretel platform implemented?
How have the design and goals of the system changed or evolved since you started working on it?
What are some of the nuances of synthetic data generation or masking that data engineers/data analysts need to be aware of as they start using Gretel?
What are the most interesting, innovative, or unexpected ways that you have seen Gretel used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gretel?
When is Gretel the wrong choice?
What do you have planned for the future of Gretel?
Contact Info
LinkedIn
@jtm_tech on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Gretel
Privacy Engineering
Weights and Biases
Red Team/Blue Team
Generative Adversarial Network
Capture The Flag in application security
CVE == Common Vulnerabilities and Exposures
Machine Learning Cold Start Problem
Faker
Mockaroo
Kaggle
Sentry
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 3, 2022 • 43min
Accelerate Development Of Enterprise Analytics With The Coalesce Visual Workflow Builder
Summary
The flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Satish Jayanthi about how organizations can use data architectural patterns to stay competitive in today’s data-rich environment
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are building at Coalesce and the story behind it?
What are the core problems that you are focused on solving with Coalesce?
The platform appears to be fairly opinionated in the workflow. What are the design principles and philosophies that you have embedded into the user experience?
Can you describe how Coalesce is implemented?
What are the pitfalls in data architecture patterns that you commonly see organizations fall prey to?
How do the pre-built transformation templates in Coalesce help to guide users in a more maintainable direction?
The platform is currently tied to Snowflake as the underlying engine. How much effort will it be to expand your integrations and the scope of Coalesece?
What are the most interesting, innovative, or unexpected ways that you have seen Coalesce used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Coalesce?
When is Coalesce the wrong choice?
What do you have planned for the future of Coalesce?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Coalesce
Data Warehouse Toolkit
Wherescape
dbt
Podcast Episode
Type 2 Dimensions
Firebase
Kubernetes
Star Schema
Data Vault
Podcast Episode
Data Mesh
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 3, 2022 • 47min
Repeatable Patterns For Designing Data Platforms And When To Customize Them
Summary
Building a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
Hey Data Engineering Podcast listeners, want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%? Join our live webinar on April 20th. Joybird director of analytics, Brett Trani, will walk through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible. Visit www.rudderstack.com/joybird?utm_source=rss&utm_medium=rss to register today.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Brandon Beidel about his data platform journey at Red Ventures
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Red Ventures is and your role there?
Given the relative newness of data product management, where do you draw inspiration and direction for how to approach your work?
What are the primary categories of data product that your data consumers are building/relying on?
What are the types of data sources that you are working with to power those downstream use cases?
Can you describe the size and composition/organization of your data team(s)?
How do you approach the build vs. buy decision while designing and evolving your data platform?
What are the tools/platforms/architectural and usage patterns that you and your team have developed for your platform?
What are the primary goals and constraints that have contributed to your decisions?
How have the goals and design of the platform changed or evolved since you started working with the team?
You recently went through the process of establishing and reporting on SLAs for your data products. Can you describe the approach you took and the useful lessons that were learned?
What are the technical and organizational components of the data work at Red Ventures that have proven most difficult?
What excites you most about the future of data engineering?
What are the most interesting, innovative, or unexpected ways that you have seen teams building more reliable data systems?
What aspects of data tooling or processes are still missing for most data teams?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data products at Red Ventures?
What do you have planned for the future of your data platform?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Red Ventures
Monte Carlo
Opportunity Cost
dbt
Podcast Episode
Apache Ranger
Privacera
Podcast Episode
Segment
Fivetran
Podcast Episode
Databricks
Bigquery
Redshift
Hightouch
Podcast Episode
Airflow
Astronomer
Podcast Episode
Airbyte
Podcast Episode
Clickhouse
Podcast Episode
Presto
Podcast Episode
Trino
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast


