
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

4 snips
Jun 27, 2022 • 1h 9min
Strategies And Tactics For A Successful Master Data Management Implementation
Summary
The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Malcolm Hawker about master data management strategies for the enterprise
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving your definition of what MDM is and the scope of activities/functions that it includes?
How have evolutions in the data landscape shifted the conversation around MDM?
Can you describe what Profisee is and the story behind it?
What was your path to joining Profisee and what is your role in the business?
Who are the target customers for Profisee?
What are the challenges that they typically experience that leads them to MDM as a solution for their problems?
How does the narrative around data observability/data quality from tools such as Great Expectations, Monte Carlo, etc. differ from the data quality benefits of a MDM strategy?
How do recent conversations around semantic/metrics layers compare to the way that MDM approaches the problem of domain modeling?
What are the steps to defining an MDM strategy for an organization or business unit?
Once there is a strategy, what are the tactical elements of the implementation?
What is the role of the toolchain in that implementation? (e.g. Spark, dbt, Airflow, etc.)
Can you describe how Profisee is implemented?
How does the customer base inform the architectural approach that Profisee has taken?
Can you describe the adoption process for an organization that is using Profisee for their MDM?
Once an organization has defined and adopted an MDM strategy, what are the ongoing maintenance tasks related to the domain models?
What are the most interesting, innovative, or unexpected ways that you have seen MDM used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working in MDM?
When is Profisee the wrong choice?
What do you have planned for the future of Profisee?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Profisee
MDM == Master Data Management
CRM == Customer Relationship Management
ERP == Enterprise Resource Planning
Levenshtein Distance Algorithm
Soundex
CDP == Customer Data Platform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 27, 2022 • 1h 7min
Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform
Summary
The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.
Your host is Tobias Macey and today I’m interviewing Isaac Brodsky about Foursquare’s Unfolded platform for working with spatial data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what the Unfolded platform is and the story behind it?
What are some of the core challenges of working with spatial data?
What are some of the sources that organizations rely on for collecting or generating those data sets?
What are the capabilities that the Unfolded platform offers for spatial analytics?
What use cases are you primarily focused on supporting?
What (if any) are the datasets or analyses that you are consciously not investing in supporting?
Can you describe how the Unfolded platform is implemented?
How have the design and goals shifted or evolved since you started working on Unfolded?
What are the new constraints or opportunities that are available after the merger with Foursquare?
Can you describe a typical workflow for someone using Unfolded to manage their spatial information and build an analysis on top of it?
What are some of the data modeling considerations that are necessary when populating a custom data set with Unfolded?
What are some of the techniques that you needed to build to allow for loading large data sets into a users’s browser while maintaining sufficient performance?
What are the most interesting, innovative, or unexpected ways that you have seen Unfolded used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Unfolded?
When is Unfolded the wrong choice?
What do you have planned for the future of Unfolded?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Unfolded Platform
H3 Hexagonal Map Tiles Library
Carto
Mapbox
Open Street Map
Raster Files
Hex Tiles
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Unstruk: 
Unstruk Data offers an API-driven solution to simplify the process of transforming unstructured data files into actionable intelligence about real-world assets without writing a line of code – putting insights generated from this data at enterprise teams’ fingertips. The company was founded in 2021 by Kirk Marple after his tenure as CTO of Kespry. Kirk possesses extensive industry knowledge including over 25 years of experience building and architecting scalable SaaS platforms and applications, prior successful startup exits, and deep unstructured and perception data experience. Unstruk investors include 8VC, Preface Ventures, Valia Ventures, Shell Ventures and Stage Venture Partners.
Go to [dataengineeringpodcast.com/unstruk](https://www.dataengineeringpodcast.com/unstruk) today to transform your messy collection of unstructured data files into actionable assets that power your business!Support Data Engineering Podcast

Jun 19, 2022 • 53min
Level Up Your Data Platform With Active Metadata
Summary
Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
Your host is Tobias Macey and today I’m interviewing Prukalpa Sankar about how data platforms can benefit from the idea of "active metadata" and the work that she and her team at Atlan are doing to make it a reality
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what "active metadata" is and how it differs from the current approaches to metadata systems?
What are some of the use cases that "active metadata" can enable for data producers and consumers?
What are the points of friction that those users encounter in the current formulation of metadata systems?
Central metadata systems/data catalogs came about as a solution to the challenge of integrating every data tool with every other data tool, giving a single place to integrate. What are the lessons that are being learned from the "modern data stack" that can be applied to centralized metadata?
Can you describe the approach that you are taking at Atlan to enable the adoption of "active metadata"?
What are the architectural capabilities that you had to build to power the outbound traffic flows?
How are you addressing the N x M integration problem for pushing metadata into the necessary contexts at Atlan?
What are the interfaces that are necessary for receiving systems to be able to make use of the metadata that is being delivered?
How does the type/category of metadata impact the type of integration that is necessary?
What are some of the automation possibilities that metadata activation offers for data teams?
What are the cases where you still need a human in the loop?
What are the most interesting, innovative, or unexpected ways that you have seen active metadata capabilities used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on activating metadata for your users?
When is an active approach to metadata the wrong choice?
What do you have planned for the future of Atlan and active metadata?
Contact Info
LinkedIn
@prukalpa on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Atlan
What is Active Metadata?
Segment
Podcast Episode
Zapier
ArgoCD
Kubernetes
Wix
AWS Lambda
Modern Data Culture Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 19, 2022 • 43min
Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas
Summary
Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.
Your host is Tobias Macey and today I’m interviewing Ryan Buick about Canvas, a spreadsheet interface for your data that lets everyone on your team explore data without having to learn SQL
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Canvas is and the story behind it?
The "modern data stack" has enabled organizations to analyze unparalleled volumes of data. What are the shortcomings in the operating model that keeps business users dependent on engineers to answer their questions?
Why is the spreadsheet such a popular and persistent metaphor for working with data?
What are the biggest issues that existing spreadsheet software run up against as they scale both technically and organizationally?
What are the new metaphors/design elements that you needed to develop to extend the existing capabilities and use cases of spreadsheets while keeping them familiar?
Can you describe how the Canvas platform is implemented?
How have the design and goals of the product changed/evolved since you started working on it?
What is the workflow for a business user that is using Canvas to iterate on a series of questions?
What are the collaborative features that you have built into Canvas and who are they for? (e.g. other business users, data engineers <-> business users, etc.)
What are the situations where the spreadsheet abstraction starts to break down?
What are the extension points/escape hatches that you have built into the product for when that happens?
What are the most interesting, innovative, or unexpected ways that you have seen Canvas used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Canvas?
When is Canvas the wrong choice?
What do you have planned for the future of Canvas?
Contact Info
LinkedIn
@ryanjbuick on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Canvas
Flexport
Podcast Episode about their data mesh implementation
Excel
Lightdash
Podcast Episode
dbt
Podcast Episode
Figma
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Jun 13, 2022 • 49min
Discover And De-Clutter Your Unstructured Data With Aparavi
Summary
Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data, no matter where it lives
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Aparavi is and the story behind it?
Who are the target customers for Aparavi and how does that inform your product roadmap and messaging?
What are some of the insights that you are able to provide about an organization’s data?
Once you have generated those insights, what are some of the actions that they typically catalyze?
What are the types of storage and data systems that you integrate with?
Can you describe how the Aparavi platform is implemented?
How do the trends in cloud storage and data systems influence the ways that you evolve the system?
Can you describe a typical workflow for an organization using Aparavi?
What are the mechanisms that you use for categorizing data assets?
What are the interfaces that you provide for data owners and operators to provide heuristics to customize classification/cataloging of data?
How can teams integrate with Aparavi to expose its insights to other tools for uses such as automation or data catalogs?
What are the most interesting, innovative, or unexpected ways that you have seen Aparavi used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aparavi?
When is Aparavi the wrong choice?
What do you have planned for the future of Aparavi?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Aparavi
SHA-512
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 13, 2022 • 1h 1min
Hire And Scale Your Data Team With Intention
Summary
Building a well rounded and effective data team is an iterative process, and the first hire can set the stage for future success or failure. Trupti Natu has been the first data hire multiple times and gone through the process of building teams across the different stages of growth. In this episode she shares her thoughts and insights on how to be intentional about establishing your own data team.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos.
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.
Your host is Tobias Macey and today I’m interviewing Trupti Natu about strategies for building your team, from the first data hire to post-acquisition
Interview
Introduction
How did you get involved in the area of FinTech & Data Science (management)?
How would you describe your overall career trajectory in data?
Can you describe what your experience has been as a data professional at different stages of company growth?
What are the traits that you look for in a first or second data hire at an organization?
What are useful metrics for success to help gauge the effectiveness of hires at this early stage of data capabilities?
What are the broad goals and projects that early data hires should be focused on?
What are the indicators that you look for to determine when to scale the team?
As you are building a team of data professionals, what are the organizational topologies that you have found most effective? (e.g. centralized vs. embedded data pros, etc.)
What are the recruiting and screening/interviewing techniques that you have found most helpful given the relative scarcity of experienced data practitioners?
What are the organizational and technical structures that are helpful to establish early in the organization’s data journey to reduce the onboarding time for new hires?
Your background has primarily been in FinTech. How does the business domain influence the types of background and domain expertise that you look for?
You recently went through an acquisition at the startup you were with. Can you describe the data-related projects that were required during the merger?
What are the impedance mismatches that you have had to resolve in your data systems, moving from a fast-moving startup into a larger, more established organization?
Being a FinTech company, what are some of the categories of regulatory considerations that you had to deal with during the integration process?
What are the most interesting, unexpected, or challenging lessons that you have learned along your career journey?
What are some of the pieces of advice that you wished you knew at the beginning of your career, and that you would like to share with others in that situation?
Contact Info
LinkedIn
@truptinatu on Twitter
Trupti is hiring for multiple product data science roles. Feel free to DM her on Twitter or LinkedIn to find out more
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
SumoLogic
FinTech
PII == Personally Identifiable Information
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 6, 2022 • 54min
Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault
Summary
The best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems. In this episode Sean Falconer explains the idea of a data privacy vault and how this new architectural element can drastically reduce the potential for making a mistake with how you manage regulated or personally identifiable information.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Sean Falconer about the idea of a data privacy vault and how the Skyflow team are working to make it turn-key
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Skyflow is and the story behind it?
What is a "data privacy vault" and how does it differ from strategies such as privacy engineering or existing data governance patterns?
What are the primary use cases and capabilities that you are focused on solving for with Skyflow?
Who is the target customer for Skyflow (e.g. how does it enter an organization)?
How is the Skyflow platform architected?
How have the design and goals of the system changed or evolved over time?
Can you describe the process of integrating with Skyflow at the application level?
For organizations that are building analytical capabilities on top of the data managed in their applications, what are the interactions with Skyflow at each of the stages in the data lifecycle?
One of the perennial problems with distributed systems is the challenge of joining data across machine boundaries. How do you mitigate that problem?
On your website there are different "vaults" advertised in the form of healthcare, fintech, and PII. What are the different requirements across each of those problem domains?
What are the commonalities?
As a relatively new company in an emerging product category, what are some of the customer education challenges that you are facing?
What are the most interesting, innovative, or unexpected ways that you have seen Skyflow used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Skyflow?
When is Skyflow the wrong choice?
What do you have planned for the future of Skyflow?
Contact Info
LinkedIn
@seanfalconer on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Skyflow
Privacy Engineering
Data Governance
Homomorphic Encryption
Polymorphic Encryption
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 6, 2022 • 59min
Bringing The Modern Data Stack To Everyone With Y42
Summary
Cloud services have made highly scalable and performant data platforms economical and manageable for data teams. However, they are still challenging to work with and manage for anyone who isn’t in a technical role. Hung Dang understood the need to make data more accessible to the entire organization and created Y42 as a better user experience on top of the "modern data stack". In this episode he shares how he designed the platform to support the full spectrum of technical expertise in an organization and the interesting engineering challenges involved.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Hung Dang about Y42, the full-stack data platform that anyone can run
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Y42 is and the story behind it?
How would you characterize your positioning in the data ecosystem?
What are the problems that you are trying to solve?
Who are the personas that you optimize for and how does that manifest in your product design and feature priorities?
How is the Y42 platform implemented?
What are the core engineering problems that you have had to address in order to tie together the various underlying services that you integrate?
How have the design and goals of the product changed or evolved since you started working on it?
What are the sharp edges and failure conditions that you have had to automate around in order to support non-technical users?
What is the process for integrating Y42 with an organization’s data systems?
What is the story for onboarding from existing systems and importing workflows (e.g. Airflow dags and dbt models)?
With your recent shift to using Git as the store of platform state, how do you approach the problem of reconciling branched changes with side effects from changes (e.g. creating tables or mutating table structures in the warehouse)?
Can you describe a typical workflow for building or modifying a business dashboard or activating data in the warehouse?
What are the interfaces and abstractions that you have built into the platform to support collaboration across roles and levels of experience? (technical or organizational)
With your focus on end-to-end support for data analysis, what are the extension points or escape hatches for use cases that you can’t support out of the box?
What are the most interesting, innovative, or unexpected ways that you have seen Y42 used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Y42?
When is Y42 the wrong choice?
What do you have planned for the future of Y42?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Y42
CDTM (Center for Digital Technology and Management)
Meltano
Podcast Episode
Airflow
Singer
dbt
Podcast Episode
Great Expectations
Podcast Episode
Airbyte
Podcast Episode
Grouparoo
Podcast Episode
Terraform
OpenTelemetry
Podcast.__init__ Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:PostHog: 
PostHog is an open source, product analytics platform. PostHog enables software teams to understand user behavior – auto-capturing events, performing product analytics and dashboarding, enabling video replays, and rolling out new features behind feature flags, all based on their single open source platform. The product’s open source approach enables companies to self-host, removing the need to send data externally. Try it out today at [dataengineeringpodcast.com/posthog](https://www.dataengineeringpodcast.com/posthog)Support Data Engineering Podcast

May 30, 2022 • 41min
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore
Summary
A large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Shireesh Thota about Singlestore (formerly MemSQL), the industry’s first modern relational database for multi-cloud, hybrid and on-premises workloads
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what SingleStore is and the story behind it?
The database market has gotten very crouded, with different areas of specialization and nuance being the differentiating factors. What are the core sets of workloads that SingleStore is aimed at addressing?
What are some of the capabilities that it offers to reduce the need to incorporate multiple data stores for application and analytical architectures?
What are some of the most valuable lessons that you learned in your time at MicroSoft that are applicable to SingleStore’s product focus and direction?
Nikita Shamgunov joined the show in October of 2018 to talk about what was then MemSQL. What are the notable changes in the engine and business that have occurred in the intervening time?
What are the macroscopic trends in data management and application development that are having the most impact on product direction?
For engineering teams that are already invested in, or considering adoption of, the "modern data stack" paradigm, where does SingleStore fit in that architecture?
What are the services or tools that might be replaced by an installation of SingleStore?
What are the efficiencies or new capabilities that an engineering team might expect by adopting SingleStore?
What are some of the features that are underappreciated/overlooked which you would like to call attention to?
What are the most interesting, innovative, or unexpected ways that you have seen SingleStore used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on SingleStore?
When is SingleStore the wrong choice?
What do you have planned for the future of SingleStore?
Contact Info
LinkedIn
@ShireeshThota on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
MemSQL Interview With Nikita Shamgunov
Singlestore
MS SQL Server
Azure Cosmos DB
CitusDB
Podcast Episode
Debezium
Podcast Episode
PostgreSQL
Podcast Episode
MySQL
HTAP == Hybrid Transactional-Analytical Processing
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

May 30, 2022 • 1h 3min
Data Cloud Cost Optimization With Bluesky Data
Summary
The latest generation of data warehouse platforms have brought unprecedented operational simplicity and effectively infinite scale. Along with those benefits, they have also introduced a new consumption model that can lead to incredibly expensive bills at the end of the month. In order to ensure that you can explore and analyze your data without spending money on inefficient queries Mingsheng Hong and Zheng Shao created Bluesky Data. In this episode they explain how their platform optimizes your Snowflake warehouses to reduce cost, as well as identifying improvements that you can make in your queries to reduce their contribution to your bill.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Mingsheng Hong and Zheng Shao about Bluesky Data where they are combining domain expertise and machine learning to optimize your cloud warehouse usage and reduce operational costs
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Bluesky is and the story behind it?
What are the platforms/technologies that you are focused on in your current early stage?
What are some of the other targets that you are considering once you validate your initial hypothesis?
Cloud cost optimization is an active area for application infrastructures as well. What are the corollaries and differences between compute and storage optimization strategies and what you are doing at Bluesky?
How have your experiences at hyperscale companies using various combinations of cloud and on-premise data platforms informed your approach to the cost management problem faced by adopters of cloud data systems?
What are the most significant drivers of cost in cloud data systems?
What are the factors (e.g. pricing models, organizational usage, inefficiencies) that lead to such inflated costs?
What are the signals that you collect for identifying targets for optimization and tuning?
Can you describe how the Bluesky mission control platform is architected?
What are the current areas of uncertainty or active research that you are focused on?
What is the workflow for a team or organization that is adding Bluesky to their system?
How does the usage of Bluesky change as teams move from the initial optimization and dramatic cost reduction into a steady state?
What are the most interesting, innovative, or unexpected ways that you have seen teams approaching cost management in the absence of Bluesky?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bluesky?
When is Bluesky the wrong choice?
What do you have planned for the future of Bluesky?
Contact Info
Mingsheng
LinkedIn
@mingshenghong on Twitter
Zheng
LinkedIn
@zshao9 on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Bluesky Data
Get A Free Health Check For Your Snowflake From Bluesky
RocksDB
Snowflake
Podcast Episode
Trino
Podcast Episode
Firebolt
Podcast Episode
Bigquery
Hive
Vertica
Michael Stonebraker
Teradata
C-Store Paper
Ottertune
Podcast Episode
dbt
Podcast Episode
infracost
Subtract: The Untapped Science of Less by Leidy Klotz
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.