
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

Sep 5, 2022 • 54min
Introduce Climate Analytics Into Your Data Platform Without The Heavy Lifting Using Sust Global
Summary
The global climate impacts everyone, and the rate of change introduces many questions that businesses need to consider. Getting answers to those questions is challenging, because the climate is a multidimensional and constantly evolving system. Sust Global was created to provide curated data sets for organizations to be able to analyze climate information in the context of their business needs. In this episode Gopal Erinjippurath discusses the data engineering challenges of building and serving those data sets, and how they are distilling complex climate information into consumable facts so you don’t have to be an expert to understand it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Gopal Erinjippurath about his work at Sust Global building data sets from geospatial and satellite information to power climate analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Sust Global is and the story behind it?
What audience(s) are you focused on?
Climate change is obviously a huge topic in the zeitgeist and has been growing in importance. What are the data sources that you are working with to derive climate information?
What role do you view Sust Global having in addressing climage change?
How are organizations using your climate information assets to inform their analytics and business operations?
What are the types of questions that they are asking about the role of climate (present and future) for their business activities?
How can they use the climate information that you provide to understand their impact on the planet?
What are some of the educational efforts that you need to undertake to ensure that your end-users understand the context and appropriate semantics of the data that you are providing? (e.g. concepts around climate science, statistically meaningful interpretations of aggregations, etc.)
Can you describe how you have architected the Sust Global platform?
What are some examples of the types of data workflows and transformations that are necessary to maintain your customer-facing services?
How have you approached the question of modeling for the data that you provide to end-users to make it straightforward to integrate and analyze the information?
What is your process for determining relevant granularities of data and normalizing scales? (e.g. time and distance)
What is involved in integrating with the Sust Global platform and how does it fit into the workflow of data engineers/analysts/data scientists at your customer organizations?
Any analytical task is an exercise in story-telling. What are some of the techniques that you and your customers have found useful to make climate data relatable and understandable?
What are some of the challenges involved in mapping between micro and macro level insights and translating them effectively for the consumer?
How does the increasing sensor capabilities and scale of coverage manifest in your data?
How do you account for increasing coverage when analyzing across longer historical time scales?
How do you balance the need to build a sustainable business with the importance of access to the information that you are working with?
What are the most interesting, innovative, or unexpected ways that you have seen Sust Global used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Sust Global?
When is Sust the wrong choice?
What do you have planned for the future of Sust Global?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Sust Global
Planet Labs
Carbon Capture
IPCC
Data Lodge(?)
6th Assessment Report
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 5, 2022 • 59min
A Reflection On Data Observability As It Reaches Broader Adoption
Summary
Data observability is a product category that has seen massive growth and adoption in recent years. Monte Carlo is in the vanguard of companies who have been enabling data teams to observe and understand their complex data systems. In this episode founders Barr Moses and Lior Gavish rejoin the show to reflect on the evolution and adoption of data observability technologies and the capabilities that are being introduced as the broader ecosystem adopts the practices.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about the state of the market for data observability and their own work at Monte Carlo
Interview
Introduction
How did you get involved in the area of data management?
Can you give the elevator pitch for Monte Carlo?
What are the notable changes in the Monte Carlo product and business since our last conversation in October 2020?
You were one of the early entrants in the market of data quality/data observability products. In your work to gain visibility and traction you invested substantially in content creation (blog posts, presentations, round table conversations, etc.). How would you summarize the focus of your initial efforts?
Why do you think data observability has really taken off? A few years ago, the category barely existed – what’s changed?
There’s a larger debate within the data engineering community regarding whether it makes sense to go deep or go broad when it comes to monitoring your data. In other words, do you start with a few important data sets, or do you attempt to cover the entire ecosystem. What is your take?
For engineers and teams who are just now investigating and investing in observability/quality automation for their data, what are their motivations?
How has the conversation around the value/motivating factors matured or changed over the past couple of years?
In what way have the requirements and capabilities of data observability platforms shifted?
What are the forces in the ecosystem that have driven those changes?
How has the scope and vision for your work at Monte Carlo evolved as the understanding and impact of data quality have become more widespread?
When teams invest in data quality/observability what are some of the ways that the insights gained influence their other priorities and design choices? (e.g. platform design, pipeline design, data usage, etc.)
When it comes to selecting what parts of the data stack to invest in, how do data leaders prioritize? For instance, when does it make sense to build or buy a data catalog? A data observability platform?
The adoption of any tool that adds constraints is a delicate balance. What have you found to be the predominant patterns for teams who are incorporating Monte Carlo? (e.g. maintaining delivery velocity and adding safety/trust)
A corollary to the goal of data engineers for higher reliability and visibility is the need by the business/team leadership to identify "return on investment". How do you and your customers think about the useful metrics and measurement goals to justify the time spent on "non-functional" requirements?
What are the most interesting, innovative, or unexpected ways that you have seen Monte Carlo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Monte Carlo?
When is Monte Carlo the wrong choice?
What do you have planned for the future of Monte Carlo?
Contact Info
Barr
LinkedIn
@BM_DataDowntime on Twitter
Lior
LinkedIn
@lgavish on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Monte Carlo
Podcast Episode
App Dynamics
Datadog
New Relic
Data Quality Fundamentals book
State Of Data Quality Survey
dbt
Podcast Episode
Airflow
Dagster
Podcast Episode
Episode: Incident Management For Data Teams
Databricks Delta
Patch.tech Snowflake APIs
Hightouch
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 29, 2022 • 1h 4min
An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality
Summary
The dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Sean Knapp about the role of data automation in building maintainable systems
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you mean by the term "data automation" and the assumptions that it includes?
One of the perennial challenges of automation is that there are always steps that are resistant to being performed without human involvement. What are some of the tasks that you have found to be common problems in that sense?
What are the different concerns that need to be included in a stack that supports fully automated data workflows?
There was recently an interesting article suggesting that the "left-to-right" approach to data workflows is backwards. In your experience, what would be required to allow for triggering data processes based on the needs of the data consumers? (e.g. "make sure that this BI dashboard is up to date every 6 hours")
What are the tasks that are most complex to build automation for?
What are some companies or tools/platforms that you consider to be exemplars of "data automation done right"?
What are the common themes/patterns that they build from?
How have you approached the need for data automation in the implementation of the Ascend product?
How have the requirements for data automation changed as data plays a more prominent role in a growing number of businesses?
What are the foundational elements that are unchanging?
What are the most interesting, innovative, or unexpected ways that you have seen data automation implemented?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data automation at Ascend?
What are some of the ways that data automation can go wrong?
What are you keeping an eye on across the data ecosystem?
Contact Info
@seanknapp on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Ascend
Podcast Episode
Google Sawzall
CI/CD
Airflow
Kubernetes
Ascend FlexCode
MongoDB
SHA == Secure Hash Algorithm
dbt
Podcast Episode
Materialized View
Great Expectations
Podcast Episode
Monte Carlo
Podcast Episode
OpenLineage
Podcast Episode
Open Metadata
Podcast Episode
Egeria
OOM == Out Of Memory Manager
Five Whys
Data Mesh
Data Fabric
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Bigeye: 
Bigeye is an industry-leading data observability platform that gives data engineering and science teams the tools they need to ensure their data is always fresh, accurate and reliable. Companies like Instacart, Clubhouse, and Udacity use Bigeye’s automated data quality monitoring, ML-powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business.
Go to [dataengineeringpodcast.com/bigeye](https://www.dataengineeringpodcast.com/bigeye) today and start trusting your data.
Support Data Engineering Podcast

38 snips
Aug 28, 2022 • 1h 10min
Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations
Summary
AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Lindsay Pettingill Chetan Sharma, Swaroop Jagadish, Maxime Beauchemin, and Nick Handel about the lessons that they learned in their time at AirBnB and how they are carrying that forward to their respective companies
Interview
Introduction
How did you get involved in the area of data management?
You all worked at AirBnB in similar time frames and then went on to found data-focused companies that are finding success in their respective categories. Do you consider it an outgrowth of the specific company culture/work involved or a curiosity of the moment in time for the data industry that led you each in that direction?
What are the elements of AirBnB’s data culture that you feel were done right?
What do you see as the critical decisions/inflection points in the company’s growth that led you down that path?
Every journey has its detours and dead-ends. What are the mistakes that were made (individual and collective) that were most instructive for you?
What about that experience and other experiences led you each to go our respective directions with data startups?
Was your motivation to start a company addressing the work that you did at AirBnB due to the desire to build on existing success, or the need to fix a nagging frustration?
What are the critical lessons for data teams that you are focused on teaching to engineers inside and outside your company?
What are your predictions for the next 5 years of data?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on translating your experiences at AirBnB into successful products?
Contact Info
Lindsay
LinkedIn
@lpettingill on Twitter
Chetan
LinkedIn
@chesharma87 on Twitter
Maxime
mistercrunch on GitHub
LinkedIn
@mistercrunch on Twitter
Swaroop
swaroopjagadish on GitHub
LinkedIn
@arudis on Twitter
Nick
LinkedIn
@NicholasHandel on Twitter
nhandel on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Iggy
Eppo
Podcast Episode
Acryl
Podcast Episode
DataHub
Preset
Superset
Podcast Episode
Airflow
Transform
Podcast Episode
Deutsche Bank
Ubisoft
BlackRock
Kafka
Pinot
Stata
R
Knowledge-Repo
Podcast.__init__ Episode
AirBnB Almond Flour Cookie Recipe
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 22, 2022 • 1h 6min
An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications
Summary
Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Shruti Bhat about the growth of real-time data applications and the systems required to support them
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what is driving the adoption of real-time analytics?
architectural patterns for real-time analytics
sources of latency in the path from data creation to end-user
end-user/customer expectations for time to insight
differing expectations between internal and external consumers
scales of data that are reasonable for real-time vs. batch
What are the most interesting, innovative, or unexpected ways that you have seen real-time architectures implemented?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rockset?
When is Rockset the wrong choice?
What do you have planned for the future of Rockset?
Contact Info
LinkedIn
@shrutibhat on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Rockset
Podcast Episode
Embedded Analytics
Confluent
Kafka
AWS Kinesis
Lambda Architecture
Data Observability
Data Mesh
DynamoDB Streams
MongoDB Change Streams
Bigeye
Monte Carlo Data
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Aug 22, 2022 • 47min
Understanding The Role Of The Chief Data Officer
Summary
The position of Chief Data Officer (CDO) is relatively new in the business world and has not been universally adopted. As a result, not everyone understands what the responsibilities of the role are, when you need one, and how to hire for it. In this episode Tracy Daniels, CDO of Truist, shares her journey into the position, her responsibilities, and her relationship to the data professionals in her organization.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Tracy Daniels about the role and responsibilities of the Chief Data Officer and how it is evolving along with the ecosystem
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what your path to CDO of Truist has been?
As a CDO, what are your responsibilities and scope of influence?
Not every organization has an explicit position for the CDO. What are the factors that determine when that should be a distinct role?
What is the relationship and potential overlap with a CTO?
As the CDO of Truist, what are some of the projects/activities that are vying for your time and attention?
Can you share the composition of your teams and how you think about organizational structure and integration for data professionals in your company?
What are the industry and business trends that are having the greatest impact on your work as a CDO?
How has your role evolved over the past few years?
What are some of the organizational politics/pressures that you have had to navigate to achieve your objectives?
What are some of the ways that priorities at the C-level can be at cross purposes to that of the CDO?
What are some of the skills and experiences that you have found most useful in your work as CDO?
What are the most interesting, innovative, or unexpected ways that you have seen the CDO position/responsibilities addressed in other organizations?
What are the most interesting, unexpected, or challenging lessons that you have learned while working as a CDO?
When is a distinct CDO position the wrong choice for an organization?
What advice do you have for anyone who is interested in charting a career path to the CDO seat?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Truist
Chief Data Officer
Chief Analytics Officer
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 14, 2022 • 1h 20min
Bringing Automation To Data Labeling For Machine Learning With Watchful
Summary
Data engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Shayan Mohanty about Watchful, a data-centric platform for labeling your machine learning inputs
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Watchful is and the story behind it?
What are your core goals at Watchful?
What problem are you solving and who are the people most impacted by that problem?
What is the role of the data engineer in the process of getting data labeled for machine learning projects?
Data labeling is a large and competitive market. How do you characterize the different approaches offered by the various platforms and services?
What are the main points of friction involved in getting data labeled?
How do the types of data and its applications factor into how those challenges manifest?
What does Watchful provide that allows it to address those obstacles?
Can you describe how Watchful is implemented?
What are some of the initial ideas/assumptions that you have had to re-evaluate?
What are some of the ways that you have had to adjust the design of your user experience flows since you first started?
What is the workflow for teams who are adopting Watchful?
What are the types of collaboration that need to happen in the data labeling process?
What are some of the elements of shared vocabulary that different stakeholders in the process need to establish to be successful?
What are the most interesting, innovative, or unexpected ways that you have seen Watchful used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Watchful?
When is Watchful the wrong choice?
What do you have planned for the future of Watchful?
Contact Info
LinkedIn
@shayanjm on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Watchful
Entity Resolution
Supervised Machine Learning
BERT
CLIP
LabelBox
Label Studio
Snorkel AI
Machine Learning Podcast Episode
RegEx == Regular Expression
REPL == Read Evaluate Print Loop
IDE == Integrated Development Environment
Turing Completeness
Clojure
Rust
Named Entity Recognition
The Halting Problem
NP Hard
Lidar
Shayan: Arguments Against Hand Labeling
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

12 snips
Aug 14, 2022 • 53min
Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery
Summary
Data is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Shinji Kim about data discovery and what is required to build and maintain useful context for your information assets
Interview
Introduction
How did you get involved in the area of data management?
Can you share your definition of "data discovery" and the technical/social/process components that are required to make it viable?
What are the differences between "data discovery" and the capabilities of a "data catalog" and how do they overlap?
discovery of assets outside the bounds of the warehouse
capturing and codifying tribal knowledge
creating a useful structure/framework for capturing data context and operationalizing it
What are the most interesting, innovative, or unexpected ways that you have seen data discovery implemented?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data discovery at SelectStar?
When might a data discovery effort be more work than is required?
What do you have planned for the future of SelectStar?
Contact Info
LinkedIn
@shinjikim on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Select Star
Podcast Episode
Fivetran
Podcast Episode
Airbyte
Podcast Episode
Tableau
PowerBI
Podcast Episode
Looker
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

21 snips
Aug 6, 2022 • 49min
Useful Lessons And Repeatable Patterns Learned From Data Mesh Implementations At AgileLab
Summary
Data mesh is a frequent topic of conversation in the data community, with many debates about how and when to employ this architectural pattern. The team at AgileLab have first-hand experience helping large enterprise organizations evaluate and implement their own data mesh strategies. In this episode Paolo Platter shares the lessons they have learned in that process, the Data Mesh Boost platform that they have built to reduce some of the boilerplate required to make it successful, and some of the considerations to make when deciding if a data mesh is the right choice for you.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Paolo Platter about Agile Lab’s lessons learned through helping large enterprises establish their own data mesh
Interview
Introduction
How did you get involved in the area of data management?
Can you share your experiences working with data mesh implementations?
What were the stated goals of project engagements that led to data mesh implementations?
What are some examples of projects where you explored data mesh as an option and decided that it was a poor fit?
What are some of the technical and process investments that are necessary to support a mesh strategy?
When implementing a data mesh what are some of the common concerns/requirements for building and supporting data products?
What are the general shape that a product will take in a mesh environment?
What are the features that are necessary for a product to be an effective component in the mesh?
What are some of the aspects of a data product that are unique to a given implementation?
You built a platform for implementing data meshes. Can you describe the technical elements of that system?
What were the primary goals that you were addressing when you decided to invest in building Data Mesh Boost?
How does Data Mesh Boost help in the implementation of a data mesh?
Code review is a common practice in construction and maintenance of software systems. How does that activity map to data systems/products?
What are some of the challenges that you have encountered around CI/CD for data products?
What are the persistent pain points involved in supporting pre-production validation of changes to data products?
Beyond the initial work of building and deploying a data product there is the ongoing lifecycle management. How do you approach refactoring old data products to match updated practices/templates?
What are some of the indicators that tell you when an organization is at a level of sophistication that can support a data mesh approach?
What are the most interesting, innovative, or unexpected ways that you have seen Data Mesh Boost used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Mesh Boost?
When is Data Mesh (Boost) the wrong choice?
What do you have planned for the future of Data Mesh Boost?
Contact Info
LinkedIn
@axlpado on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
AgileLab
Spark
Cloudera
Zhamak Dehghani
Data Mesh
Data Fabric
Data Virtualization
q-lang
Data Mesh Boost
Data Mesh Marketplace
SourceGraph
OpenMetadata
Egeria
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 6, 2022 • 59min
Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus
Summary
The optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. These platforms store direct representations of the vector embeddings that machine learning models rely on for computing relevant predictions so that there is no additional processing required to go from input data to inference output. In this episode Frank Liu explains how the open source Milvus vector database is implemented to speed up machine learning development cycles, how to think about proper storage and scaling of these vectors, and how data engineering and machine learning teams can collaborate on the creation and maintenance of these data sets.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Frank Liu about the open source vector database Milvus and how it simplifies the work of supporting ML teams
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Milvus is and the story behind it?
What are the goals of the project?
Who is the target audience for this database?
What are the use cases for a vector database and similarity search of vector embeddings?
What are some of the unique capabilities that this category of database engine introduces?
Can you describe how Milvus is architected?
What are the primary system requirements that have influenced the design choices?
How have the goals and implementation evolved since you started working on it?
What are some of the interesting details that you have had to address in the storage layer to allow for fast and efficient retrieval of vector embeddings?
What are the limitations that you have had to impose on size or dimensionality of vectors to allow for a consistent user experience in a running system?
The reference material states that similarity between two vectors implies similarity in the source data. What are some of the characteristics of vector embeddings that might make them immune or susceptible to confusion of similarity across different source data types that share some implicit relationship due to specifics of their vectorized representation? (e.g. an image vs. an audio file, etc.)
What are the available deployment models/targets and how does that influence potential use cases?
What is the workflow for someone who is building an application on top of Milvus?
What are some of the data management considerations that are introduced by vector databases? (e.g. manage versions of vectors, metadata management, etc.)
What are the most interesting, innovative, or unexpected ways that you have seen Milvus used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Milvus?
When is Milvus the wrong choice?
What do you have planned for the future of Milvus?
Contact Info
LinkedIn
fzliu on GitHub
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Milvus
Zilliz
Linux Foundation/AI & Data
MySQL
PostgreSQL
CockroachDB
Pilosa
Podcast Episode
Pinecone Vector DB
Podcast Episode
Vector Embedding
Reverse Image Search
Vector Arithmetic
Vector Distance
SIGMOD
Tensor
Rotation Matrix
L2 Distance
Cosine Distance
OpenAI CLIP
Knowhere
Kafka
Pulsar
Podcast Episode
CAP Theorem
Milvus Helm Chart
Zilliz Cloud
MinIO
Towhee
Attu
Feder
FPGA == Field Programmable Gate Array
TPU == Tensor Processing Unit
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.