

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Jul 17, 2022 • 1h 7min
Making The Total Cost Of Ownership For External Data Manageable With Crux
Summary
There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!
Your host is Tobias Macey and today I’m interviewing Mark Etherington about Crux, a platform that helps organizations scale their most critical data delivery, operations, and transformation needs
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Crux is and the story behind it?
What are the categories of information that organizations use external data sources for?
What are the challenges and long-term costs related to integrating external data sources that are most often overlooked or underestimated?
What are some of the primary risks involved in working with external data sources?
How do you work with customers to help them understand the long-term costs associated with integrating various sources?
How does that play into the broader conversation about assessing the value of a given data-set?
Can you describe how you have architected the Crux platform?
How have the design and goals of the platform changed or evolved since you started working on it?
What are the design choices that have had the most significant impact on your ability to reduce operational complexity and maintenance overhead for the data you are working with?
For teams who are relying on Crux to manage external data, what is involved in setting up the initial integration with your system?
What are the steps to on-board new data sources?
How do you manage data quality/data observability across your different data providers?
What kinds of signals do you propagate to your customers to feed into their operational platforms?
What are the most interesting, innovative, or unexpected ways that you have seen Crux used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Crux?
When is Crux the wrong choice?
What do you have planned for the future of Crux?
Contact Info
Email
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Crux
Thomson Reuters
Goldman Sachs
JP Morgan
Avro
ESG == Environmental, Social, Government Data
Selenium
Google Cloud Platform
Cadence
Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Shipyard: 
Shipyard is an orchestration platform that helps data teams build out solid data operations from the get-go by connecting data tools and streamlining data workflows. Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build workflows while enabling engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.
Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams. With a high level of concurrency, scalability, and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them. Go to [dataengineeringpodcast.com/shipyard](https://www.dataengineeringpodcast.com/shipyard) to get started automating powerful workflows with their free developer plan today!Support Data Engineering Podcast

Jul 10, 2022 • 40min
Charting the Path of Riskified's Data Platform Journey
Summary
Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!
Your host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service
Interview
Introduction
How did you get involved in the area of data management?
What does Riskified do?
Can you describe the role of data at Riskified?
What are some of the core types and sources of information that you are dealing with?
Who/what are the primary consumers of the data that you are responsible for?
What are the team structures that you have tested for your data professionals?
What is the composition of your data roles? (e.g. ML engineers, data engineers, data scientists, data product managers, etc.)
What are the organizational constraints that have the biggest impact on the design and usage of your data systems?
Can you describe the current architecture of your data platform?
What are some of the most notable evolutions/redesigns that you have gone through?
What is your process for establishing and evaluating selection criteria for any new technologies that you adopt?
How do you facilitate knowledge sharing between data professionals?
What have you found to be the most challenging technological and organizational complexities that you have had to address on the path to your current state?
What are the methods that you use for staying up to date with the data ecosystem? (opportunity to discuss Haya Data conference)
In your role as organizers of the Haya Data conference, what are some of the insights that you have gained into the present state and future trajectory of the data community?
What are the most interesting, innovative, or unexpected ways that you have seen the Riskified data platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data platform for Riskified?
What do you have planned for the future of your data platform?
Contact Info
Inbar
LinkedIn
Lior
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Riskified
ADABAS
Aerospike
Podcast Episode
Neo4J
Kafka
Delta Lake
Podcast Episode
Databricks
Snowflake
Podcast Episode
Tableau
Looker
Podcast Episode
Redshift
Event Sourcing
Avro
hayaData Conference
Data Mesh
Data Catalog
Data Governance
MLOps
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 10, 2022 • 1h 5min
Maintain Your Data Engineers' Sanity By Embracing Automation
Summary
Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development
Interview
Introduction
How did you get involved in the area of data management?
What are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense?
What are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in?
The application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent?
What are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate?
As someone with a software engineering background and extensive experience working in data, what are the missing links to make those teams/objectives work together more seamlessly?
How can tooling and automation help in that endeavor?
A key factor in the adoption of automation for application delivery is automated tests. What are some of the strategies you find useful for identifying scope and targets for testing/monitoring of data products?
As data usage and capabilities grow and evolve in an organization, what are the junction points that are in greatest need of well-defined data contracts?
How can automation aid in enforcing and alerting on those contracts in a continuous fashion?
What are the most interesting, innovative, or unexpected ways that you have seen automation of data operations used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on automation for data systems?
When is automation the wrong choice?
What does the future of data engineering look like?
Contact Info
Website
@criccomini on Twitter
criccomini on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
WePay
Enterprise Service Bus
The Missing README
Hadoop
Confluent Schema Registry
Podcast Episode
Avro
CDC == Change Data Capture
Debezium
Podcast Episode
Data Mesh
What the heck is a data mesh? blog post
SRE == Site Reliability Engineer
Terraform
Chef configuration management tool
Puppet configuration management tool
Ansible configuration management tool
BigQuery
Airflow
Pulumi
Podcast.__init__ Episode
Monte Carlo
Podcast Episode
Bigeye
Podcast Episode
Anomalo
Podcast Episode
Great Expectations
Podcast Episode
Schemata
Data Engineering Weekly newsletter
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 3, 2022 • 1h 11min
Be Confident In Your Data Integration By Quickly Validating Matching Records With data-diff
Summary
The perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try!
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy and Simon Eskildsen about their work to open source the data diff utility that they have been building at Datafold
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what the data diff tool is and the story behind it?
What was your motivation for going through the process of releasing your data diff functionality as an open source utility?
What are some of the ways that data-diff composes with other data quality tools? (e.g. Great Expectations, Soda SQL, etc.)
Can you describe how data-diff is implemented?
Given the target of having a performant and scalable utility how did you approach the question of language selection?
What are some of the ways that you have seen data-diff incorporated in the workflow of data teams?
What were the steps that you needed to do to get the project cleaned up and separated from your internal implementation for release as open source?
What are the most interesting, innovative, or unexpected ways that you have seen data-diff used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data-diff?
When is data-diff the wrong choice?
What do you have planned for the future of data-diff?
Contact Info
Gleb
LinkedIn
@glebmm on Twitter
Simon
Website
@Sirupsen on Twitter
sirupsen on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Datafold
Podcast Episode
data-diff
Autodesk
Airbyte
Podcast Episode
Debezium
Podcast Episode
Napkin Math newsletter
Airflow
Dagster
Podcast Episode
Great Expectations
Podcast Episode
dbt
Podcast Episode
Trino
Preql
Podcast.__init__ Episode
Erez Shinan
Fivetran
Podcast Episode
md5
CRC32
Merkle Tree
Locally Optimistic
Presto
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Special Guest: Gleb Mezhanskiy.Support Data Engineering Podcast

Jul 3, 2022 • 59min
The View From The Lakehouse Of Architectural Patterns For Your Data Platform
Summary
The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!
Your host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping the current generation of data architectures
Interview
Introduction
How did you get involved in the area of data management?
In your opinion as an astrophysicist, how well does the metaphor of a starburst map to your current work at the company of the same name?
Can you describe what you see as the dominant factors that influence a team’s approach to data architecture and design?
Two of the most repeated (often mis-attributed) terms in the data ecosystem for the past couple of years are the "modern data stack" and the "data mesh". As someone who is working at a company that can be construed to provide solutions for either/both of those patterns, what are your thoughts on their lasting strength and long-term viability?
What do you see as the strengths of the emerging lakehouse architecture in the context of the "modern data stack"?
What are the factors that have prevented it from being a default choice compared to cloud data warehouses? (e.g. BigQuery, Redshift, Snowflake, Firebolt, etc.)
What are the recent developments that are contributing to its current growth?
What are the weak points/sharp edges that still need to be addressed? (both internal to the platforms and in the external ecosystem/integrations)
What are some of the implementation challenges that teams often experience when trying to adopt a lakehouse strategy as the core building block of their data systems?
What are some of the exercises that they should be performing to help determine their technical and organizational capacity to support that strategy over the long term?
One of the core requirements for a data mesh implementation is to have a common system that allows for product teams to easily build their solutions on top of. How do lakehouse/data virtualization systems allow for that?
What are some of the lessons that need to be shared with engineers to help them make effective use of these technologies when building their own data products?
What are some of the supporting services that are helpful in these undertakings?
What do you see as the forces that will have the most influence on the trajectory of data architectures over the next 2 – 5 years?
What are the most interesting, innovative, or unexpected ways that you have seen lakehouse architectures used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Starburst product?
When is a lakehouse the wrong choice?
What do you have planned for the future of Starburst’s technology platform?
Contact Info
LinkedIn
@ctartow on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Starburst
Trino
Teradata
Cognos
Data Lakehouse
Data Virtualization
Iceberg
Podcast Episode
Hudi
Podcast Episode
Delta
Podcast Episode
Snowflake
Podcast Episode
AWS Lake Formation
Clickhouse
Podcast Episode
Druid
Pinot
Podcast Episode
Starburst Galaxy
Varada
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Jun 27, 2022 • 1h 9min
Strategies And Tactics For A Successful Master Data Management Implementation
Summary
The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Malcolm Hawker about master data management strategies for the enterprise
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving your definition of what MDM is and the scope of activities/functions that it includes?
How have evolutions in the data landscape shifted the conversation around MDM?
Can you describe what Profisee is and the story behind it?
What was your path to joining Profisee and what is your role in the business?
Who are the target customers for Profisee?
What are the challenges that they typically experience that leads them to MDM as a solution for their problems?
How does the narrative around data observability/data quality from tools such as Great Expectations, Monte Carlo, etc. differ from the data quality benefits of a MDM strategy?
How do recent conversations around semantic/metrics layers compare to the way that MDM approaches the problem of domain modeling?
What are the steps to defining an MDM strategy for an organization or business unit?
Once there is a strategy, what are the tactical elements of the implementation?
What is the role of the toolchain in that implementation? (e.g. Spark, dbt, Airflow, etc.)
Can you describe how Profisee is implemented?
How does the customer base inform the architectural approach that Profisee has taken?
Can you describe the adoption process for an organization that is using Profisee for their MDM?
Once an organization has defined and adopted an MDM strategy, what are the ongoing maintenance tasks related to the domain models?
What are the most interesting, innovative, or unexpected ways that you have seen MDM used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working in MDM?
When is Profisee the wrong choice?
What do you have planned for the future of Profisee?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Profisee
MDM == Master Data Management
CRM == Customer Relationship Management
ERP == Enterprise Resource Planning
Levenshtein Distance Algorithm
Soundex
CDP == Customer Data Platform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 27, 2022 • 1h 7min
Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform
Summary
The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.
Your host is Tobias Macey and today I’m interviewing Isaac Brodsky about Foursquare’s Unfolded platform for working with spatial data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what the Unfolded platform is and the story behind it?
What are some of the core challenges of working with spatial data?
What are some of the sources that organizations rely on for collecting or generating those data sets?
What are the capabilities that the Unfolded platform offers for spatial analytics?
What use cases are you primarily focused on supporting?
What (if any) are the datasets or analyses that you are consciously not investing in supporting?
Can you describe how the Unfolded platform is implemented?
How have the design and goals shifted or evolved since you started working on Unfolded?
What are the new constraints or opportunities that are available after the merger with Foursquare?
Can you describe a typical workflow for someone using Unfolded to manage their spatial information and build an analysis on top of it?
What are some of the data modeling considerations that are necessary when populating a custom data set with Unfolded?
What are some of the techniques that you needed to build to allow for loading large data sets into a users’s browser while maintaining sufficient performance?
What are the most interesting, innovative, or unexpected ways that you have seen Unfolded used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Unfolded?
When is Unfolded the wrong choice?
What do you have planned for the future of Unfolded?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Unfolded Platform
H3 Hexagonal Map Tiles Library
Carto
Mapbox
Open Street Map
Raster Files
Hex Tiles
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Unstruk: 
Unstruk Data offers an API-driven solution to simplify the process of transforming unstructured data files into actionable intelligence about real-world assets without writing a line of code – putting insights generated from this data at enterprise teams’ fingertips. The company was founded in 2021 by Kirk Marple after his tenure as CTO of Kespry. Kirk possesses extensive industry knowledge including over 25 years of experience building and architecting scalable SaaS platforms and applications, prior successful startup exits, and deep unstructured and perception data experience. Unstruk investors include 8VC, Preface Ventures, Valia Ventures, Shell Ventures and Stage Venture Partners.
Go to [dataengineeringpodcast.com/unstruk](https://www.dataengineeringpodcast.com/unstruk) today to transform your messy collection of unstructured data files into actionable assets that power your business!Support Data Engineering Podcast

Jun 19, 2022 • 53min
Level Up Your Data Platform With Active Metadata
Summary
Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
Your host is Tobias Macey and today I’m interviewing Prukalpa Sankar about how data platforms can benefit from the idea of "active metadata" and the work that she and her team at Atlan are doing to make it a reality
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what "active metadata" is and how it differs from the current approaches to metadata systems?
What are some of the use cases that "active metadata" can enable for data producers and consumers?
What are the points of friction that those users encounter in the current formulation of metadata systems?
Central metadata systems/data catalogs came about as a solution to the challenge of integrating every data tool with every other data tool, giving a single place to integrate. What are the lessons that are being learned from the "modern data stack" that can be applied to centralized metadata?
Can you describe the approach that you are taking at Atlan to enable the adoption of "active metadata"?
What are the architectural capabilities that you had to build to power the outbound traffic flows?
How are you addressing the N x M integration problem for pushing metadata into the necessary contexts at Atlan?
What are the interfaces that are necessary for receiving systems to be able to make use of the metadata that is being delivered?
How does the type/category of metadata impact the type of integration that is necessary?
What are some of the automation possibilities that metadata activation offers for data teams?
What are the cases where you still need a human in the loop?
What are the most interesting, innovative, or unexpected ways that you have seen active metadata capabilities used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on activating metadata for your users?
When is an active approach to metadata the wrong choice?
What do you have planned for the future of Atlan and active metadata?
Contact Info
LinkedIn
@prukalpa on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Atlan
What is Active Metadata?
Segment
Podcast Episode
Zapier
ArgoCD
Kubernetes
Wix
AWS Lambda
Modern Data Culture Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 19, 2022 • 43min
Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas
Summary
Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.
Your host is Tobias Macey and today I’m interviewing Ryan Buick about Canvas, a spreadsheet interface for your data that lets everyone on your team explore data without having to learn SQL
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Canvas is and the story behind it?
The "modern data stack" has enabled organizations to analyze unparalleled volumes of data. What are the shortcomings in the operating model that keeps business users dependent on engineers to answer their questions?
Why is the spreadsheet such a popular and persistent metaphor for working with data?
What are the biggest issues that existing spreadsheet software run up against as they scale both technically and organizationally?
What are the new metaphors/design elements that you needed to develop to extend the existing capabilities and use cases of spreadsheets while keeping them familiar?
Can you describe how the Canvas platform is implemented?
How have the design and goals of the product changed/evolved since you started working on it?
What is the workflow for a business user that is using Canvas to iterate on a series of questions?
What are the collaborative features that you have built into Canvas and who are they for? (e.g. other business users, data engineers <-> business users, etc.)
What are the situations where the spreadsheet abstraction starts to break down?
What are the extension points/escape hatches that you have built into the product for when that happens?
What are the most interesting, innovative, or unexpected ways that you have seen Canvas used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Canvas?
When is Canvas the wrong choice?
What do you have planned for the future of Canvas?
Contact Info
LinkedIn
@ryanjbuick on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Canvas
Flexport
Podcast Episode about their data mesh implementation
Excel
Lightdash
Podcast Episode
dbt
Podcast Episode
Figma
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Jun 13, 2022 • 49min
Discover And De-Clutter Your Unstructured Data With Aparavi
Summary
Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data, no matter where it lives
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Aparavi is and the story behind it?
Who are the target customers for Aparavi and how does that inform your product roadmap and messaging?
What are some of the insights that you are able to provide about an organization’s data?
Once you have generated those insights, what are some of the actions that they typically catalyze?
What are the types of storage and data systems that you integrate with?
Can you describe how the Aparavi platform is implemented?
How do the trends in cloud storage and data systems influence the ways that you evolve the system?
Can you describe a typical workflow for an organization using Aparavi?
What are the mechanisms that you use for categorizing data assets?
What are the interfaces that you provide for data owners and operators to provide heuristics to customize classification/cataloging of data?
How can teams integrate with Aparavi to expose its insights to other tools for uses such as automation or data catalogs?
What are the most interesting, innovative, or unexpected ways that you have seen Aparavi used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aparavi?
When is Aparavi the wrong choice?
What do you have planned for the future of Aparavi?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Aparavi
SHA-512
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast