

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

May 9, 2022 • 1h 1min
Exploring The Insights And Impact Of Dan Delorey's Distinguished Career In Data
Summary
Dan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Your host is Tobias Macey and today I’m interviewing Dan Delorey about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and early engineer on Dremel
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing what your current relationship to the data ecosystem is and the cliffs-notes version of how you ended up there?
Dremel was a ground-breaking technology at the time. What do you see as its lasting impression on the landscape of data both in and outside of Google?
You were instrumental in crafting the vision behind "querying data in place," (what they called, federated data) at Dremel and BigQuery. What do you mean by this? How has this approach evolved? What are some challenges with this approach?
How well did the Drill project capture the core principles of Dremel as outlined in the eponymous white paper?
Following your work on Drill you were involved with the development and growth of BigQuery and the broader suite of Google Cloud’s data platform. What do you see as the influence that those tools had on the evolution of the broader data ecosystem?
How have your experiences at Google influenced your approach to platform and organizational design at SoFi?
What’s in SoFi’s data stack? How do you decide what technologies to buy vs. build in-house?
How does your team solve for data quality and governance?
What are the dominating factors that you consider when deciding on project/product priorities for your team?
When you’re not building industry-defining data tooling or leading data strategy, you spend time thinking about the ethics of data. Can you elaborate a bit about your research and interest there?
You also have some ideas about data marketplaces, which is a hot topic these days with companies like Snowflake and Databricks breaking into this economy. What’s your take on the evolution of this space?
What are the most interesting, innovative, or unexpected data systems that you have encountered?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on building and supporting data systems?
What are the areas that you are paying the most attention to?
What interesting predictions do you have for the future of data systems and their applications?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
SoFi
Bigquery
Dremel
Brigham Young University
Empirical Software Engineering
Map/Reduce
Hadoop
Sawzall
VLDB Test Of Time Award Paper
GFS
Colossus
Partitioned Hash Join
Google BigTable
HBase
AWS Athena
Snowflake
Podcast Episode
Data Vault
Star Schema
Privacy Vault
Homomorphic Encryption
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

May 9, 2022 • 40min
Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database
Summary
Many of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. These connections are best represented and analyzed as graphs to provide efficient and accurate analysis of their relationships. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. In this episode Jon Herke shares how TigerGraph customers are taking advantage of those capabilities to achieve meaningful discoveries in their fields, the utilities that it provides for modeling and managing your connected data, and some of his own experiences working with the platform before joining the company.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
Your host is Tobias Macey and today I’m interviewing Jon Herke about TigerGraph, a distributed native graph database
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what TigerGraph is and the story behind it?
What are some of the core use cases that you are focused on supporting?
How has TigerGraph changed over the past 4 years since I spoke with Todd Blaschka at the Open Data Science Conference?
How has the ecosystem of graph databases changed in usage and design in recent years?
What are some of the persistent areas of confusion or misinformation that you encounter when explaining graph databases and TigerGraph to potential users?
The tagline on your website says that TigerGraph is "The Only Scalable Graph Database for the Enterprise". Can you unpack that claim and explain what is necessary for a graph database to be suitable for enterprise use?
What are some of the typical application and system architectures that you typically see for end-users of TigerGraph? (e.g. polyglot persistence, etc.)
What are the cases where TigerGraph should be the system of record as opposed to an optimization option for addressing highly connected data?
What are the data modeling considerations that end-users should be thinking of when planning their storage structures in TigerGraph?
What are the most interesting, innovative, or unexpected ways that you have seen TigerGraph used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on TigerGraph?
When is TigerGraph the wrong choice?
What do you have planned for the future of TigerGraph?
Contact Info
LinkedIn
@jonherke on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
TigerGraph
GraphQL
Kafka
GQL (Graph Query Language)
LDBC (Linked Data Benchmark Council)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

May 2, 2022 • 53min
Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion
Summary
The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
Your host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Matillion is and the story behind it?
What are the use cases and user personas that you are focused on supporting?
How does that influence the focus and pace of your feature development and priorities?
How is Matillion architected?
How have the design and goals of the system changed since you started working on it?
The ecosystems of both cloud technologies and data processing have been rapidly growing and evolving, with new patterns and paradigms being introduced. What are the elements of your product focus and messaging that you have had to update and what are the core principles that have stayed the same?
What have been the most challenging integrations to build and support?
What is a typical workflow for integrating Matillion into an organization and building a set of pipelines?
What are some of the patterns that have been useful for managing incidental complexity as usage scales?
What are the most interesting, innovative, or unexpected ways that you have seen Matillion used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Matillion?
When is Matillion the wrong choice?
What do you have planned for the future of Matillion?
Contact Info
LinkedIn
Matillion Contact
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Matillion
Twitter
IBM DB2
Cognos
Talend
Redshift
AWS Marketplace
AWS Re:Invent
Azure
GCP == Google Cloud Platform
Informatica
SSIS == SQL Server Integration Services
PCRE == Perl Compatible Regular Expressions
Teradata
Tomcat
Collibra
Alation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

May 2, 2022 • 1h 4min
Evolving And Scaling The Data Platform at Yotpo
Summary
Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Doron Porat and Liran Yogev about their experiences designing and implementing a self-serve data platform at Yotpo
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Yotpo is and the role that data plays in the organization?
What are the core data types and sources that you are working with?
What kinds of data assets are being produced and how do those get consumed and re-integrated into the business?
What are the user personas that you are supporting and what are the interfaces that they are comfortable interacting with?
What is the size of your team and how is it structured?
You recently posted about the current architecture of your data platform. What was the starting point on your platform journey?
What did the early stages of feature and platform evolution look like?
What was the catalyst for making a concerted effort to integrate your systems into a cohesive platform?
What was the scope and directive of the project for building a platform?
What are the metrics and capabilities that you are optimizing for in the structure of your data platform?
What are the organizational or regulatory constraints that you needed to account for?
What are some of the early decisions that affected your available choices in later stages of the project?
What does the current state of your architecture look like?
How long did it take to get to where you are today?
What were the factors that you considered in the various build vs. buy decisions?
How did you manage cost modeling to understand the true savings on either side of that decision?
If you were to start from scratch on a new data platform today what might you do differently?
What are the decisions that proved helpful in the later stages of your platform development?
What are the most interesting, innovative, or unexpected ways that you have seen your platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing your platform?
What do you have planned for the future of your platform infrastructure?
Contact Info
Doron
LinkedIn
Liran
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Yotpo
Data Platform Architecture Blog Post
Greenplum
Databricks
Metorikku
Apache Hive
CDC == Change Data Capture
Debezium
Podcast Episode
Apache Hudi
Podcast Episode
Upsolver
Podcast Episode
Spark
PrestoDB
Snowflake
Podcast Episode
Druid
Rockset
Podcast Episode
dbt
Podcast Episode
Acryl
Podcast Episode
Atlan
Podcast Episode
OpenLineage
Podcast Episode
Okera
Shopify Data Warehouse Episode
Redshift
Delta Lake
Podcast Episode
Iceberg
Podcast Episode
Outbox Pattern
Backstage
Roadie
Nomad
Kubernetes
Deequ
Great Expectations
Podcast Episode
LakeFS
Podcast Episode
2021 Recap Episode
Monte Carlo
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Apr 24, 2022 • 1h 11min
Operational Analytics At Speed With Minimal Busy Work Using Incorta
Summary
A huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
Your host is Tobias Macey and today I’m interviewing Matthew Halliday about Incorta, an in-memory, unified data and analytics platform as a service
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Incorta is and the story behind it?
What are the use cases and customers that you are focused on?
How does that focus inform the design and priorities of functionality in the product?
What are the technologies and workflows that Incorta might replace?
What are the systems and services that it is intended to integrate with and extend?
Can you describe how Incorta is implemented?
What are the core technological decisions that were necessary to make the product successful?
How have the design and goals of the system changed and evolved since you started working on it?
Can you describe the workflow for building an end-to-end analysis using Incorta?
What are some of the new capabilities or use cases that Incorta enables which are impractical or intractable with other combinations of tools in the ecosystem?
How do the features of Incorta influence the approach that teams take for data modeling?
What are the points of collaboration and overlap between organizational roles while using Incorta?
What are the most interesting, innovative, or unexpected ways that you have seen Incorta used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Incorta?
When is Incorta the wrong choice?
What do you have planned for the future of Incorta?
Contact Info
LinkedIn
@layereddelay on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Incorta
3rd Normal Form
Parquet
Podcast Episode
Delta Lake
Podcast Episode
Iceberg
Podcast Episode
PrestoDB
PySpark
Dataiku
Angular
React
Apache ECharts
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 24, 2022 • 59min
Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs
Summary
There are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Andy Dang about powering observability of AI systems with the whylogs data logging library
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Whylabs is and the story behind it?
How is "data logging" differentiated from logging for the purpose of debugging and observability of software logic?
What are the use cases that you are aiming to support with Whylogs?
How does it compare to libraries and services like Great Expectations/Monte Carlo/Soda Data/Datafold etc.
Can you describe how Whylogs is implemented?
How have the design and goals of the project changed or evolved since you started working on it?
How do you maintain feature parity between the Python and Java integrations?
How do you structure the log events and metadata to provide detail and context for data applications?
How does that structure support aggregation and interpretation/analysis of the log information?
What is the process for integrating Whylogs into an existing project?
Once you have the code instrumented with log events, what is the workflow for using Whylogs to debug and maintain a data application?
What have you found to be useful heuristics for identifying what to log?
What are some of the strategies that teams can use to maintain a balance of signal vs. noise in the events that they are logging?
How is the Whylogs governance set up and how are you approaching sustainability of the open source project?
What are the additional utilities and services that you anticipate layering on top of/integrating with Whylogs?
What are the most interesting, innovative, or unexpected ways that you have seen Whylogs used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Whylabs?
When is Whylogs/Whylabs the wrong choice?
What do you have planned for the future of Whylabs?
Contact Info
LinkedIn
@andy_dng on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Whylogs
Whylabs
Spark
Airflow
Pandas
Podcast Episode
Data Sketches
Grafana
Great Expectations
Podcast Episode
Monte Carlo
Podcast Episode
Soda Data
Podcast Episode
Datafold
Podcast Episode
Delta Lake
Podcast Episode
HyperLogLog
MLFlow
Flyte
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 18, 2022 • 40min
Connecting To The Next Frontier Of Computing With Quantum Networks
Summary
The next paradigm shift in computing is coming in the form of quantum technologies. Quantum procesors have gained significant attention for their speed and computational power. The next frontier is in quantum networking for highly secure communications and the ability to distribute across quantum processing units without costly translation between quantum and classical systems. In this episode Prineha Narang, co-founder and CTO of Aliro, explains how these systems work, the capabilities that they can offer, and how you can start preparing for a post-quantum future for your data systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Your host is Tobias Macey and today I’m interviewing Dr. Prineha Narang about her work at Aliro building quantum networking technologies and how it impacts the capabilities of data systems
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Aliro is and the story behind it?
What are the use cases that you are focused on?
What is the impact of quantum networks on distributed systems design? (what limitations does it remove?)
What are the failure modes of quantum networks?
How do they differ from classical networks?
How can network technologies bridge between classical and quantum connections and where do those transitions happen?
What are the latency/bandwidth capacities of quantum networks?
How does it influence the network protocols used during those communications?
How much error correction is necessary during the quantum communication stages of network transfers?
How does quantum computing technology change the landscape for AI technologies?
How does that impact the work of data engineers who are building the systems that power the data feeds for those models?
What are the most interesting, innovative, or unexpected ways that you have seen quantum technologies used for data systems?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aliro and your academic research?
When are quantum technologies the wrong choice?
What do you have planned for the future of Aliro and your research efforts?
Contact Info
LinkedIn
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Aliro Quantum
Harvard University
CalTech
Quantum Computing
Quantum Repeater
ARPANet
Trapped Ion Quantum Computer
Photonic Computing
SDN == Software Defined Networking
QPU == Quantum Processing Unit
IEEE
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 16, 2022 • 1h 16min
What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?
Summary
Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing Demetrios Brinkmann and David Aponte about what you need to know about MLOps as a data engineer
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what MLOps is?
How does it relate to DataOps? DevOps? (is it just another buzzword?)
What is your interest and involvement in the space of MLOps?
What are the open and active questions in the MLOps community?
Who is responsible for MLOps in an organization?
What is the role of the data engineer in that process?
What are the core capabilities that are necessary to support an "MLOps" workflow?
How do the current platform technologies support the adoption of MLOps workflows?
What are the areas that are currently underdeveloped/underserved?
Can you describe the technical and organizational design/architecture decisions that need to be made when endeavoring to adopt MLOps practices?
What are some of the common requirements for supporting ML workflows?
What are some of the ways that requirements become bespoke to a given organization or project?
What are the opportunities for standardization or consolidation in the tooling for MLOps?
What are the pieces that are always going to require custom engineering?
What are the most interesting, innovative, or unexpected approaches to MLOps workflows/platforms that you have seen?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on supporting the MLOps community?
What are your predictions for the future of MLOps?
What are you keeping a close eye on?
Contact Info
Demetrios
LinkedIn
@Dpbrinkm on Twitter
Medium
David
LinkedIn
@aponteanalytics on Twitter
aponte411 on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
MLOps Community
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz (affiliate link)
MLOps
DataOps
DevOps
The Sequence Newsletter
Neptune.ai
Algorithmia
Kubeflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 11, 2022 • 58min
DataOps As A Service For Your Data Integration Workflows With Rivery
Summary
Data engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Itamar Ben Hemo about Rivery, a SaaS platform designed to provide an end-to-end solution for Ingestion, Transformation, Orchestration, and Data Operations
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Rivery is and the story behind it?
What are the primary goals of Rivery as a platform and company?
What are the target personas for the Rivery platform?
What are the points of interaction/workflows for each of those personas?
What are some of the positive and negative sources of inspiration that you looked to while deciding on the scope of the platform?
The majority of recently formed companies are focused on narrow and composable concerns of data management. What do you see as the shortcomings of that approach?
What are some of the tradeoffs between integrating independent tools vs buying into an ecosystem?
How is the Rivery platform designed and implemented?
How have the design and goals of the platform changed or evolved since you began working on it?
What were your criteria for the MVP that would allow you to test your hypothesis?
How has the evolution of the ecosystem influenced your product strategy?
One of the interesting features that you offer is the catalog of "kits" to quickly set up common workflows. How do you manage regression/integration testing for those kits as the Rivery platform evolves?
What are the most interesting, innovative, or unexpected ways that you have seen Rivery used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rivery?
When is Rivery the wrong choice?
What do you have planned for the future of Rivery?
Contact Info
LinkedIn
@ItamarBenHemo on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Rivery
Matillion
BigQuery
Snowflake
Podcast Episode
dbt
Podcast Episode
Fivetran
Podcast Episode
Snowpark
Postman
Debezium
Podcast Episode
Snowflake Partner Connect
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 10, 2022 • 49min
Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel
Summary
Any time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing John Myers about privacy engineering and use cases for synthetic data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Gretel is and the story behind it?
How do you define "privacy engineering"?
In an organization or data team, who is typically responsible for privacy engineering?
How would you characterize the current state of the art and adoption for privacy engineering?
Who are the target users of Gretel and how does that inform the features and design of the product?
What are the stages of the data lifecycle where Gretel is used?
Can you describe a typical workflow for integrating Gretel into data pipelines for business analytics or ML model training?
How is the Gretel platform implemented?
How have the design and goals of the system changed or evolved since you started working on it?
What are some of the nuances of synthetic data generation or masking that data engineers/data analysts need to be aware of as they start using Gretel?
What are the most interesting, innovative, or unexpected ways that you have seen Gretel used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gretel?
When is Gretel the wrong choice?
What do you have planned for the future of Gretel?
Contact Info
LinkedIn
@jtm_tech on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Gretel
Privacy Engineering
Weights and Biases
Red Team/Blue Team
Generative Adversarial Network
Capture The Flag in application security
CVE == Common Vulnerabilities and Exposures
Machine Learning Cold Start Problem
Faker
Mockaroo
Kaggle
Sentry
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast