
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

Mar 14, 2022 • 54min
Accelerating Adoption Of The Modern Data Stack At 5X Data
Summary
The modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Your host is Tobias Macey and today I’m interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are doing at 5x and the story behind it?
How has your focus and operating model shifted since we spoke a year ago?
What are the biggest shifts in the market for data management that you have seen in that time?
What are the main challenges that your customers are facing when they start working with you?
What are the components that you are relying on to build repeatable data platforms for your customers?
What are the sharp edges that you have had to smooth out to scale your implementation of those systems?
What do you see as the white spaces that still exist in the offerings available for the "modern data stack"?
With the rapid introduction of so many new products in the data ecosystem, what are the categories that you see as being a long-term necessity?
What are the areas that you predict will merge and consolidate over the next 3 – 5 years?
What are the most interesting, innovative, or unexpected types of problems that you and your collaborators have had the opportunity to work on?
What are the most interesting, unexpected, or challenging lessons that you have learned while building the 5x organization?
When is 5x the wrong choice?
What do you have planned for the future of 5x?
Contact Info
LinkedIn
@tarush on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
5X Data
Podcast Interview
Snowflake
Podcast Interview
dbt
Podcast Interview
Fivetran
Podcast Interview
Looker
Podcast Interview
Matt Turck State of Data
Mixpanel
Amplitude
Heap
Podcast Episode
Bigquery
Narrator
Podcast Episode
Marquez
Podcast Episode
Atlan
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 14, 2022 • 1h 3min
Taking A Multidimensional Approach To Data Observability At Acceldata
Summary
Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale
Your host is Tobias Macey and today I’m interviewing Tristan Spaulding about Acceldata, a platform offering multidimensional data observability for modern data infrastructure
Interview
Introduction
How did you get involved in the area of data?
Can you describe what Acceldata is and the story behind it?
What does it mean for a data observability platform to be "multidimensional"?
How do the architectural characteristics of the "modern data stack" influence the requirements and implementation of data observability strategies?
The data observability ecosystem has seen a lot of activity over the past ~2-3 years. What are the unique capabilities/use cases that Acceldata supports?
Who are your target users and how does that focus influence the way that you have approached feature and design priorities?
What are some of the ways that you are using the Acceldata platform to run Acceldata?
Can you describe how the Acceldata platform is implemented?
How have the design and goals of the system changed or evolved since you started working on it?
How are you managing the definition, collection, and correlation of events across stages of the data lifecycle?
What are some of the ways that performance data can feed back into the debugging and maintenance of an organization’s data ecosystem?
What are the challenges that data platform owners face when trying to interpret the metrics and events that are available in a system like Acceldata?
What are the most interesting, innovative, or unexpected ways that you have seen Acceldata used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acceldata?
When is Acceldata the wrong choice?
What do you have planned for the future of Acceldata?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Acceldata
Semantic Web
Hortonworks
dbt
Podcast Episode
Firebolt
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 5, 2022 • 1h 17min
Move Your Database To The Data And Speed Up Your Analytics With DuckDB
Summary
When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing Hannes Mühleisen about DuckDB, an in-process embedded database engine for columnar analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what DuckDB is and the story behind it?
Where did the name come from?
What are some of the use cases that DuckDB is designed to support?
The interface for DuckDB is similar (at least in spirit) to SQLite. What are the deciding factors for when to use one vs. the other?
How might they be used in concert to take advantage of their relative strengths?
What are some of the ways that DuckDB can be used to better effect than options provided by different language ecosystems?
Can you describe how DuckDB is implemented?
How has the design and goals of the project changed or evolved since you began working on it?
What are some of the optimizations that you have had to make in order to support performant access to data that exceeds available memory?
Can you describe a typical workflow of incorporating DuckDB into an analytical project?
What are some of the libraries/tools/systems that DuckDB might replace in the scope of a project or team?
What are some of the overlooked/misunderstood/under-utilized features of DuckDB that you would like to highlight?
What is the governance model and plan long-term sustainability of the project?
What are the most interesting, innovative, or unexpected ways that you have seen DuckDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckDB?
When is DuckDB the wrong choice?
What do you have planned for the future of DuckDB?
Contact Info
Hannes Mühleisen
@hfmuehleisen on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
DuckDB
CWI
SQLite
OLAP == Online Analytical Processing
Duck Typing
ZODB
Teradata
HTAP == Hybrid Transactional/Analytical Processing
Pandas
Podcast.__init__ Episode
Apache Arrow
Julia Language
Voltron Data
Parquet
Thrift
Protobuf
Vectorized Query Processor
LLVM
DuckDB Labs
DuckDB Foundation
MIT Open Courseware (OCW)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 5, 2022 • 50min
Developer Friendly Application Persistence That Is Fast And Scalable With HarperDB
Summary
Databases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Stephen Goldberg about HarperDB, a developer-friendly distributed database engine designed to scale across edge and cloud environments
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what HarperDB is and the story behind it?
There has been an explosion of database engines over the past 5 – 10 years, with each entrant offering specific capabilities. What are the use cases that HarperDB is focused on addressing?
What are the issues that you experienced with existing database engines that led to the creation of HarperDB?
In what ways does HarperDB address those issues?
What are some of the ways that the focus on developers has influenced the interfaces and features of HarperDB?
What is your view on the role of the database in the near to medium future?
Can you describe how HarperDB is implemented?
How have the design and goals changed from when you first started working on it?
One of the common difficulties in document oriented databases is being able to conduct performant joins. What are the considerations that users need to be aware of as they are designing their data models?
What are some examples of deployment topologies that HarperDB can support given the pub/sub replication model?
What are some of the data modeling/database design strategies that users of HarperDB should know in order to take full advantage of its capabilities?
With the dynamic schema capabilities allowing developers to add attributes and mutate the table structure at any point, what are the options for schema enforcment? (e.g. add an integer attribute and another record tries to write a string to that attribute location)
What are the most interesting, innovative, or unexpected ways that you have seen HarperDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on HarperDB?
When is HarperDB the wrong choice?
What do you have planned for the future of HarperDB?
Contact Info
LinkedIn
@sgoldberg on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
HarperDB
@harperdbio on Twitter
Mulesoft
Zapier
LMDB
SocketIO
SocketCluster
MongoDB
CouchDB
PostgreSQL
VoltDB
Heroku
SAP/Hana
NodeJS
DynamoDB
CockroachDB
Podcast Episode
Fastify
HTAP == Hybrid Transactional Analytical Processing
Splunk
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 28, 2022 • 55min
Manage Your Unstructured Data Assets Across Cloud And Hybrid Environments With Komprise
Summary
There are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Your host is Tobias Macey and today I’m interviewing Krishna Subramanian about her work at Komprise to generate value from unstructured file and object data across storage formats and locations
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Komprise is and the story behind it?
Who are the target customers of the Komprise platform?
What are the core use cases that you are focused on supporting?
How would you characterize the common approaches to managing file storage solutions for hybrid cloud environments?
What are some of the shortcomings of the enterprise storage providers’ methods for managing storage tiers when trying to use that data for analytical workloads?
Given the growth in popularity and capabilities of cloud solutions, how have you approached the strategic positioning of your product to capitalize on the market?
Can you describe how the Komprise platform is architected?
What are some of the most complex considerations that you have had to engineer for when dealing with enterprise data distribution in hybrid cloud environments?
What are the data replication and consistency guarantees that you are able to offer while spanning across on-premise and cloud systems/block and object storage? (e.g. eventual consistency vs. read-after-write, low latency replication on data changes vs. scheduled syncing, etc.)
How do you determine and validate the heuristics that you use for understanding how/when to distribute files across storage systems?
How does the specific workload that you are powering influence the specific operations/capabilities that your customers take advantage of?
What are the most interesting, innovative, or unexpected ways that you have seen Komprise used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Komprise?
When is Komprise the wrong choice?
What do you have planned for the future of Komprise?
Contact Info
LinkedIn
@cloudKrishna on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Komprise
Unstruk
Podcast Episode
SMB
NFS
S3
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 28, 2022 • 40min
Reflections On Designing A Data Platform From Scratch
Summary
Building a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
I’m your host, Tobias Macey, and today I’m sharing the approach that I’m taking while designing a data platform
Interview
Introduction
How did you get involved in the area of data management?
What are the components that need to be considered when designing a solution?
Data integration (extract and load)
What are your data sources?
Batch or streaming (acceptable latencies)
Data storage (lake or warehouse)
How is the data going to be used?
What other tools/systems will need to integrate with it?
The warehouse (Bigquery, Snowflake, Redshift) has become the focal point of the "modern data stack"
Data orchestration
Who will be managing the workflow logic?
Metadata repository
Types of metadata (catalog, lineage, access, queries, etc.)
Semantic layer/reporting
Data applications
Implementation phases
Build a single end-to-end workflow of a data application using a single category of data across sources
Validate the ability for an analyst/data scientist to self-serve a notebook powered analysis
Iterate
Risks/unknowns
Data modeling requirements
Specific implementation details as integrations across components are built
When to use a vendor and risk lock-in vs. spend engineering time
Contact Info
Email
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Presto
Podcast Episode
Trino
Podcast Episode
Dagster
Podcast Episode
Prefect
Podcast Episode
Dremio
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 21, 2022 • 43min
Understanding The Immune System With Data At ImmunAI
Summary
The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
Your host is Tobias Macey and today I’m interviewing Guy Yachdav, Director of Software Engineering at Immunai, about his work at Immunai to wrangle biological data for advancing research into the human immune system.
Interview
Introduction (see Guy’s bio below)
How did you get involved in the area of data management?
Can you describe what Immunai is and the story behind it?
What are some of the categories of information that you are working with?
What kinds of insights are you trying to power/questions that you are trying to answer with that data?
Who are the stakeholders that you are working with and how does that influence your approach to the integration/transformation/presentation of the data?
What are some of the challenges unique to the biological data domain that you have had to address?
What are some of the limitations in the off-the-shelf tools when applied to biological data?
How have you approached the selection of tools/techniques/technologies to make your work maintainable for your engineers and accessible for your end users?
Can you describe the platform architecture that you are using to support your stakeholders?
What are some of the constraints or requirements (e.g. regulatory, security, etc.) that you need to account for in the design?
What are some of the ways that you make your data accessible to AI/ML engineers?
What are the most interesting, innovative, or unexpected ways that you have seen Immunai used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working at Immunai?
What do you have planned for the future of the Immunai data platform?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
ImmunAI
Apache Arrow
Columbia Genome Center
Dagster
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 21, 2022 • 1h 1min
Build Your Python Data Processing Your Way And Run It Anywhere With Fugue
Summary
Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies.
Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Fugue is and the story behind it?
What are the core goals of the Fugue project?
Who are the target users for Fugue and how does that influence the feature priorities and API design?
How does Fugue compare to projects such as Modin, etc. for abstracting over the execution engine?
What are some of the sharp edges that contribute to the engineering effort required to migrate from a single machine to Spark or Dask?
What are some of the determining factors that will influence the decision of whether to use Pandas, Spark, or Dask?
Can you describe how Fugue is implemented?
How have the design and goals of the project changed or evolved since you started working on it?
How do you ensure the consistency of logic across execution engines?
Can you describe the workflow of integrating Fugue into an existing or greenfield project?
How have you approached the work of automating logic optimization across execution contexts?
What are some of the risks or error conditions that you have to guard against?
How do you manage validation of those optimizations, particularly as the different engines release new versions or capabilities?
What are the most interesting, innovative, or unexpected ways that you have seen Fugue used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Fugue?
When is Fugue the wrong choice?
What do you have planned for the future of Fugue?
Contact Info
LinkedIn
Email
Fugue Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Fugue
Fugue Tutorials
Prefect
Podcast Episode
Bodo
Podcast Episode
Pandas
DuckDB
Koalas
Dask
Podcast Episode
Spark
Modin
Podcast.__init__ Episode
Fugue SQL
Flink
PyCaret
ANTLR
OmniSci
Ibis
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 14, 2022 • 1h 2min
Bring Your Code To Your Streaming And Static Data Without Effort With The Deephaven Real Time Query Engine
Summary
Streaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
Your host is Tobias Macey and today I’m interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Deephaven is and the story behind it?
What is the role of Deephaven in the context of an organization’s data platform?
What are the upstream and downstream systems and teams that it is likely to be integrated with?
Who are the target users of Deephaven and how does that influence the feature priorities and design of the platform?
comparison of use cases/experience with Materialize
What are the different components that comprise the suite of functionality in Deephaven?
How have you architected the system?
What are some of the ways that the goals/design of the platform have changed or evolved since you started working on it?
What are some of the impedance mismatches that you have had to address between supporting different language environments and data access patterns? (e.g. batch/streaming/ML and Python/Java/R)
Can you describe some common workflows that a data engineer might build with Deephaven?
What are the avenues for collaboration across data roles and stakeholders?
licensing choice/governance model
What are the most interesting, innovative, or unexpected ways that you have seen Deephaven used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Deephaven?
When is Deephaven the wrong choice?
What do you have planned for the future of Deephaven?
Contact Info
@pete_paco on Twitter
@deephaven on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Deephaven
GitHub
Materialize
Podcast Episode
Arrow Flight
kSQLDB
Podcast Episode
Redpanda
Podcast Episode
Pandas
Podcast Episode
NumPy
Numba
Barrage
Debezium
Podcast Episode
JPy
Sabermetrics
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 14, 2022 • 48min
Build Your Own End To End Customer Data Platform With Rudderstack
Summary
Collecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Soumyadeb Mitra about his experience as the founder of Rudderstack and its role in your data platform
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Rudderstack is and the story behind it?
What are the main use cases that Rudderstack is designed to support?
Who are the target users of Rudderstack?
How does the availability of the managed cloud service change the user profiles that you can target?
How do these user profiles influence your focus and prioritization of features and user experience?
How would you characterize the position of Rudderstack in the current data ecosystem?
What other tools/systems might you replace with Rudderstack?
How do you think about the application of Rudderstack compared to tools for data integration (e.g. Singer, Stitch, Fivetran) and reverse ETL (e.g. Grouparoo, Hightouch, Census)?
Can you describe how the Rudderstack platform is designed and implemented?
How have the goals/design/use cases of Rudderstack changed or evolved since you first started working on it?
What are the different extension points available for engineers to extend and customize Rudderstack?
Working with customer data is a core capability in Rudderstack. How do you manage the identity resolution of users as they transition back and forth between anonymous and identified?
What are some of the data privacy primitives that you include to assist with data security/regulatory concerns?
What is the process of getting started with Rudderstack as a software or data platform engineer?
What are some of the operational challenges related to running your own deployment of Rudderstack?
What are some of the overlooked/underemphasized capabilities of Rudderstack?
How have you approached the governance model/boundaries between OSS and commercial for Rudderstack?
What are the most interesting, innovative, or unexpected ways that you have seen Rudderstack used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rudderstack?
When is Rudderstack the wrong choice?
What do you have planned for the future of Rudderstack?
Contact Info
LinkedIn
@soumyadeb_mitra on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Rudderstack
Hadoop
Spark
Segment
Podcast Episode
Grouparoo
Podcast Episode
Fivetran
Podcast Episode
Stitch
Singer
Podcast Episode
Census
Podcast Episode
Hightouch
Podcast Episode
LiveRamp
Airbyte
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.