

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Feb 28, 2022 • 55min
Manage Your Unstructured Data Assets Across Cloud And Hybrid Environments With Komprise
Summary
There are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Your host is Tobias Macey and today I’m interviewing Krishna Subramanian about her work at Komprise to generate value from unstructured file and object data across storage formats and locations
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Komprise is and the story behind it?
Who are the target customers of the Komprise platform?
What are the core use cases that you are focused on supporting?
How would you characterize the common approaches to managing file storage solutions for hybrid cloud environments?
What are some of the shortcomings of the enterprise storage providers’ methods for managing storage tiers when trying to use that data for analytical workloads?
Given the growth in popularity and capabilities of cloud solutions, how have you approached the strategic positioning of your product to capitalize on the market?
Can you describe how the Komprise platform is architected?
What are some of the most complex considerations that you have had to engineer for when dealing with enterprise data distribution in hybrid cloud environments?
What are the data replication and consistency guarantees that you are able to offer while spanning across on-premise and cloud systems/block and object storage? (e.g. eventual consistency vs. read-after-write, low latency replication on data changes vs. scheduled syncing, etc.)
How do you determine and validate the heuristics that you use for understanding how/when to distribute files across storage systems?
How does the specific workload that you are powering influence the specific operations/capabilities that your customers take advantage of?
What are the most interesting, innovative, or unexpected ways that you have seen Komprise used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Komprise?
When is Komprise the wrong choice?
What do you have planned for the future of Komprise?
Contact Info
LinkedIn
@cloudKrishna on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Komprise
Unstruk
Podcast Episode
SMB
NFS
S3
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 28, 2022 • 40min
Reflections On Designing A Data Platform From Scratch
Summary
Building a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
I’m your host, Tobias Macey, and today I’m sharing the approach that I’m taking while designing a data platform
Interview
Introduction
How did you get involved in the area of data management?
What are the components that need to be considered when designing a solution?
Data integration (extract and load)
What are your data sources?
Batch or streaming (acceptable latencies)
Data storage (lake or warehouse)
How is the data going to be used?
What other tools/systems will need to integrate with it?
The warehouse (Bigquery, Snowflake, Redshift) has become the focal point of the "modern data stack"
Data orchestration
Who will be managing the workflow logic?
Metadata repository
Types of metadata (catalog, lineage, access, queries, etc.)
Semantic layer/reporting
Data applications
Implementation phases
Build a single end-to-end workflow of a data application using a single category of data across sources
Validate the ability for an analyst/data scientist to self-serve a notebook powered analysis
Iterate
Risks/unknowns
Data modeling requirements
Specific implementation details as integrations across components are built
When to use a vendor and risk lock-in vs. spend engineering time
Contact Info
Email
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Presto
Podcast Episode
Trino
Podcast Episode
Dagster
Podcast Episode
Prefect
Podcast Episode
Dremio
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 21, 2022 • 43min
Understanding The Immune System With Data At ImmunAI
Summary
The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
Your host is Tobias Macey and today I’m interviewing Guy Yachdav, Director of Software Engineering at Immunai, about his work at Immunai to wrangle biological data for advancing research into the human immune system.
Interview
Introduction (see Guy’s bio below)
How did you get involved in the area of data management?
Can you describe what Immunai is and the story behind it?
What are some of the categories of information that you are working with?
What kinds of insights are you trying to power/questions that you are trying to answer with that data?
Who are the stakeholders that you are working with and how does that influence your approach to the integration/transformation/presentation of the data?
What are some of the challenges unique to the biological data domain that you have had to address?
What are some of the limitations in the off-the-shelf tools when applied to biological data?
How have you approached the selection of tools/techniques/technologies to make your work maintainable for your engineers and accessible for your end users?
Can you describe the platform architecture that you are using to support your stakeholders?
What are some of the constraints or requirements (e.g. regulatory, security, etc.) that you need to account for in the design?
What are some of the ways that you make your data accessible to AI/ML engineers?
What are the most interesting, innovative, or unexpected ways that you have seen Immunai used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working at Immunai?
What do you have planned for the future of the Immunai data platform?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
ImmunAI
Apache Arrow
Columbia Genome Center
Dagster
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 21, 2022 • 1h 1min
Build Your Python Data Processing Your Way And Run It Anywhere With Fugue
Summary
Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies.
Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Fugue is and the story behind it?
What are the core goals of the Fugue project?
Who are the target users for Fugue and how does that influence the feature priorities and API design?
How does Fugue compare to projects such as Modin, etc. for abstracting over the execution engine?
What are some of the sharp edges that contribute to the engineering effort required to migrate from a single machine to Spark or Dask?
What are some of the determining factors that will influence the decision of whether to use Pandas, Spark, or Dask?
Can you describe how Fugue is implemented?
How have the design and goals of the project changed or evolved since you started working on it?
How do you ensure the consistency of logic across execution engines?
Can you describe the workflow of integrating Fugue into an existing or greenfield project?
How have you approached the work of automating logic optimization across execution contexts?
What are some of the risks or error conditions that you have to guard against?
How do you manage validation of those optimizations, particularly as the different engines release new versions or capabilities?
What are the most interesting, innovative, or unexpected ways that you have seen Fugue used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Fugue?
When is Fugue the wrong choice?
What do you have planned for the future of Fugue?
Contact Info
LinkedIn
Email
Fugue Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Fugue
Fugue Tutorials
Prefect
Podcast Episode
Bodo
Podcast Episode
Pandas
DuckDB
Koalas
Dask
Podcast Episode
Spark
Modin
Podcast.__init__ Episode
Fugue SQL
Flink
PyCaret
ANTLR
OmniSci
Ibis
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 14, 2022 • 1h 2min
Bring Your Code To Your Streaming And Static Data Without Effort With The Deephaven Real Time Query Engine
Summary
Streaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
Your host is Tobias Macey and today I’m interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Deephaven is and the story behind it?
What is the role of Deephaven in the context of an organization’s data platform?
What are the upstream and downstream systems and teams that it is likely to be integrated with?
Who are the target users of Deephaven and how does that influence the feature priorities and design of the platform?
comparison of use cases/experience with Materialize
What are the different components that comprise the suite of functionality in Deephaven?
How have you architected the system?
What are some of the ways that the goals/design of the platform have changed or evolved since you started working on it?
What are some of the impedance mismatches that you have had to address between supporting different language environments and data access patterns? (e.g. batch/streaming/ML and Python/Java/R)
Can you describe some common workflows that a data engineer might build with Deephaven?
What are the avenues for collaboration across data roles and stakeholders?
licensing choice/governance model
What are the most interesting, innovative, or unexpected ways that you have seen Deephaven used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Deephaven?
When is Deephaven the wrong choice?
What do you have planned for the future of Deephaven?
Contact Info
@pete_paco on Twitter
@deephaven on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Deephaven
GitHub
Materialize
Podcast Episode
Arrow Flight
kSQLDB
Podcast Episode
Redpanda
Podcast Episode
Pandas
Podcast Episode
NumPy
Numba
Barrage
Debezium
Podcast Episode
JPy
Sabermetrics
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 14, 2022 • 48min
Build Your Own End To End Customer Data Platform With Rudderstack
Summary
Collecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Soumyadeb Mitra about his experience as the founder of Rudderstack and its role in your data platform
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Rudderstack is and the story behind it?
What are the main use cases that Rudderstack is designed to support?
Who are the target users of Rudderstack?
How does the availability of the managed cloud service change the user profiles that you can target?
How do these user profiles influence your focus and prioritization of features and user experience?
How would you characterize the position of Rudderstack in the current data ecosystem?
What other tools/systems might you replace with Rudderstack?
How do you think about the application of Rudderstack compared to tools for data integration (e.g. Singer, Stitch, Fivetran) and reverse ETL (e.g. Grouparoo, Hightouch, Census)?
Can you describe how the Rudderstack platform is designed and implemented?
How have the goals/design/use cases of Rudderstack changed or evolved since you first started working on it?
What are the different extension points available for engineers to extend and customize Rudderstack?
Working with customer data is a core capability in Rudderstack. How do you manage the identity resolution of users as they transition back and forth between anonymous and identified?
What are some of the data privacy primitives that you include to assist with data security/regulatory concerns?
What is the process of getting started with Rudderstack as a software or data platform engineer?
What are some of the operational challenges related to running your own deployment of Rudderstack?
What are some of the overlooked/underemphasized capabilities of Rudderstack?
How have you approached the governance model/boundaries between OSS and commercial for Rudderstack?
What are the most interesting, innovative, or unexpected ways that you have seen Rudderstack used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rudderstack?
When is Rudderstack the wrong choice?
What do you have planned for the future of Rudderstack?
Contact Info
LinkedIn
@soumyadeb_mitra on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Rudderstack
Hadoop
Spark
Segment
Podcast Episode
Grouparoo
Podcast Episode
Fivetran
Podcast Episode
Stitch
Singer
Podcast Episode
Census
Podcast Episode
Hightouch
Podcast Episode
LiveRamp
Airbyte
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 7, 2022 • 60min
Scale Your Spatial Analysis By Building It In SQL With Syntax Extensions
Summary
Along with globalization of our societies comes the need to analyze the geospatial and geotemporal data that is needed to manage the growth in commerce, communications, and other activities. In order to make geospatial analytics more maintainable and scalable there has been an increase in the number of database engines that provide extensions to their SQL syntax that supports manipulation of spatial data. In this episode Matthew Forrest shares his experiences of working in the domain of geospatial analytics and the application of SQL dialects to his analysis.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
Your host is Tobias Macey and today I’m interviewing Matthew Forrest about doing spatial analysis in SQL
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what spatial SQL is and some of the use cases that it is relevant for?
compatibility with/comparison to syntax from PostGIS
What is involved in implementation of spatial logic in database engines
mapping geospatial concepts into declarative syntax
foundational data types
data modeling
workflow for analyzing spatial data sets outside of database engines
translating from e.g. geopandas to SQL
level of support in database engines for spatial data types
What are the most interesting, innovative, or unexpected ways that you have seen spatial SQL used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with spatial SQL?
When is SQL the wrong choice for spatial analysis?
What do you have planned for the future of spatial analytics support in SQL for the Carto platform?
Contact Info
LinkedIn
Website
@mbforr on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Carto
Spatial SQL Blog Post
Spatial Analysis
PostGIS
QGIS
KML
Shapefile
GeoJSON
Paul Ramsey’s Blog
Norwegian SOSI
GDAL
Google Cloud Dataflow
GeoBEAM
Carto Data Observatory
WGS84 Projection
EPSG Code
PySAL
GeoMesa
Uber H3 Spatial Indexing
PGRouting
Spatialite
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 6, 2022 • 1h
Scalable Strategies For Protecting Data Privacy In Your Shared Data Sets
Summary
There are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors. He also explains how he and his team at Privacy Dynamics are working to make those strategies more accessible to organizations so that you can focus on all of the other tasks required of you.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Will Thompson about managing data privacy concerns for data sets used in analytics and machine learning
Interview
Introduction
How did you get involved in the area of data management?
Data privacy is a multi-faceted problem domain. Can you start by enumerating the different categories of privacy concern that are involved in analytical use cases?
Can you describe what Privacy Dynamics is and the story behind it?
Which categor(y|ies) are you focused on addressing?
What are some of the best practices in the definition, protection, and enforcement of data privacy policies?
Is there a data security/privacy equivalent to the OWASP top 10?
What are some of the techniques that are available for anonymizing data while maintaining statistical utility/significance?
What are some of the engineering/systems capabilities that are required for data (platform) engineers to incorporate these practices in their platforms?
What are the tradeoffs of encryption vs. obfuscation when anonymizing data?
What are some of the types of PII that are non-obvious?
What are the risks associated with data re-identification, and what are some of the vectors that might be exploited to achieve that?
How can privacy risks mitigation be maintained as new data sources are introduced that might contribute to these re-identification vectors?
Can you describe how Privacy Dynamics is implemented?
What are the most challenging engineering problems that you are dealing with?
How do you approach validation of a data set’s privacy?
What have you found to be useful heuristics for identifying private data?
What are the risks of false positives vs. false negatives?
Can you describe what is involved in integrating the Privacy Dynamics system into an existing data platform/warehouse?
What would be required to integrate with systems such as Presto, Clickhouse, Druid, etc.?
What are the most interesting, innovative, or unexpected ways that you have seen Privacy Dynamics used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacy Dynamics?
When is Privacy Dynamics the wrong choice?
What do you have planned for the future of Privacy Dynamics?
Contact Info
LinkedIn
@willseth on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Privacy Dynamics
Pandas
Podcast Episode – Pandas For Data Engineering
Homomorphic Encryption
Differential Privacy
Immuta
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 31, 2022 • 42min
A Reflection On Learning A Lot More Than 97 Things Every Data Engineer Should Know
Summary
The Data Engineering Podcast has been going for five years now and has included conversations and interviews with a huge number of guests, covering a broad range of topics. In addition to that, the host curated the essays contained in the book "97 Things Every Data Engineer Should Know", using the knowledge and context gained from running the show to inform the selection process. In this episode he shares some reflections on producing the podcast, compiling the book, and relevant trends in the ecosystem of data engineering. He also provides some advice for those who are early in their career of data engineering and looking to advance in their roles.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
Your host is Tobias Macey and today I’m doing something a bit different. I’m going to talk about some of the lessons that I have learned while running the podcast, compiling the book "97 Things Every Data Engineer Should Know", and some of the themes that I’ve observed throughout.
Interview
Introduction
How did you get involved in the area of data management?
Overview of the 97 things book
How the project came about
Goals of the book
What are the paths into data engineering?
What are some of the macroscopic themes in the industry?
What are some of the microscopic details that are useful/necessary to succeed as a data engineer?
What are some of the career/team/organizational details that are helpful for data engineers?
What are the most interesting, innovative, or unexpected outcomes/feedback that I have seen from running the podcast and working on the book?
What are the most interesting, unexpected, or challenging lessons that I have learned while working on the Data Engineering Podcast and 97 things book?
What do I have planned for the future of the podcast?
Contact Info
LinkedIn
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
97 Things Every Data Engineer Should Know
Buy on Amazon (affiliate link)
Read on O’Reilly Learning
O’Reilly Learning 30 Day Free Trial
Podcast.__init__
Pipeline Academy data engineering bootcamp
Podcast Episode
Hadoop
Object Relational Mapper (ORM)
Singer
Podcast Episode
Airbyte
Podcast Episode
Data Mesh
Podcast Episode
Data Contracts Episode
Designing Data Intensive Applications
Data Council
2022 Conference
Data Engineering Weekly Newsletter
Data Mesh Learning
MLOps Community
Analytics Engineering Newsletter
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 31, 2022 • 1h
Effective Pandas Patterns For Data Engineering
Summary
Pandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Matt Harrison about useful tips for using Pandas for data engineering projects
Interview
Introduction
How did you get involved in the area of data management?
What are the main tasks that you have seen Pandas used for in a data engineering context?
What are some of the common mistakes that can lead to poor performance when scaling to large data sets?
What are some of the utility features that you have found most helpful for data processing?
One of the interesting add-ons to Pandas is its integration with Arrow. What are some of the considerations for how and when to use the Arrow capabilities vs. out-of-the-box Pandas?
Pandas is a tool that spans data processing and data science. What are some of the ways that data engineers should think about writing their code to make it accessible to data scientists for supporting collaboration across data workflows?
Pandas is often used for transformation logic. What are some of the ways that engineers should approach the design of their code to make it understandable and maintainable?
How can data engineers support testing their transformations?
There are a number of projects that aim to scale Pandas logic across cores and clusters. What are some of the considerations for when to use one of these tools, and how to select the proper framework? (e.g. Dask, Modin, Ray, etc.)
What are some anti-patterns that engineers should guard against when using Pandas for data processing?
What are the most interesting, innovative, or unexpected ways that you have seen Pandas used for data processing?
When is Pandas the wrong choice for data processing?
What are some of the projects related to Pandas that you are keeping an eye on?
Contact Info
@__mharrison__ on Twitter
metasnake
Effective Pandas Bundle (affiliate link with 20% discount code applied)
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Metasnake
Snowflake Schema
OLAP
Panel Data
NumPy
Dask
Podcast Episode
Parquet
Arrow
Feather
Zen of Python
Joel Grus’ I Don’t Like Notebooks presentation
Pandas Method Chaining
Effective Pandas Book (affiliate link with 20% discount code applied)
Podcast.__init__ Episode
pytest
Podcast.__init__ Episode
Great Expectations
Podcast Episode
Hypothesis
Podcast.__init__ Episode
Papermill
Podcast Episode
Jupytext
Koalas
Modin
Podcast.__init__ Episode
Spark
Ray
Podcast.__init__ Episode
Spark Pandas API
Vaex
Rapids
Terality
H2O
H2O DataTable
Fugue
Ibis
Multi-process Pandas
PandaPy
Polars
Google Colab
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast