

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Sep 4, 2021 • 60min
Designing And Building Data Platforms As A Product
Summary
The term "data platform" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. This is a valuable conversation about how to approach the work of selecting the tools that you use to power your data systems and considerations for how they can be woven together for a unified experience across your various stakeholders.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Lior Gavish, Lior Solomon, and Atul Gupte about the technical, social, and architectural aspects of building your data platform as a product for your internal customers
Interview
Introduction
How did you get involved in the area of data management? – all
Can we start by establishing a definition of "data platform" for the purpose of this conversation?
Who are the stakeholders in a data platform?
Where does the responsibility lie for creating and maintaining ("owning") the platform?
What are some of the technical and organizational constraints that are likely to factor into the design and execution of the platform?
What are the minimum set of requirements necessary to qualify as a platform? (as opposed to a collection of discrete components)
What are the additional capabilities that should be in place to simplify the use and maintenance of the platform?
How are data platforms managed? Are they managed by technical teams, product managers, etc.? What is the profile for a data product manager? – Atul G.
How do you set SLIs / SLOs with your data platform team when you don’t have clear metrics you’re tracking? – Lior S.
There has been a lot of conversation recently about different interpretations of the "modern data stack". For a team who is just starting to build out their platform, how much credence should they be giving to those debates?
What are the first steps that you recommend for those practitioners?
If an organization already has infrastructure in place for data/analytics, how might they think about building or buying their way toward a well integrated platform?
Once a platform is established, what are some challenges that teams should anticipate in scaling the platform?
Which axes of scale have you found to be most difficult to manage? (scale of infrastructure capacity, scale of organizational/technical complexity, scale of usage, etc.)
Do we think the "data platform" is a skill set? How do we split up the role of the platform? Is there one for real-time? Is there one for ETLs?
How do you handle the quality and reliability of the data powering your solution?
What are helpful techniques that you have used for collecting, prioritizing, and managing feature requests?
How do you justify the budget and resources for your data platform?
How do you measure the success of a data platform?
What is the relationship between a data platform and data products?
Are there any other companies you admire when it comes to building robust, scalable data architecture?
What are the most interesting, innovative, or unexpected ways that you have seen data platforms used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building and operating a data platform?
When is a data platform the wrong choice? (as opposed to buying an integrated solution, etc.)
What are the industry trends that you are monitoring/excited for in the space of data platforms?
Contact Info
Lior Gavish
LinkedIn
@lgavish on Twitter
Lior Solomon
LinkedIn
@liorsolomon on Twitter
Atul Gupte
LinkedIn
@atulgupte on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Monte Carlo
Vimeo
Facebook
Uber
Zynga
Great Expectations
Podcast Episode
Airflow
Podcast.__init__ Episode
Fivetran
Podcast Episode
dbt
Podcast Episode
Snowflake
Podcast Episode
Looker
Podcast Episode
Modern Data Stack Podcast Episode
Stitch
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 2, 2021 • 1h 1min
Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana
Summary
The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible option for running your analytics in the cloud. In this episode Dipti Borkar discusses the work that she and her team are doing at Ahana to simplify the work of running your own PrestoDB environment in the cloud. She explains how they are optimizin the runtime to reduce latency and increase query throughput, the ways that they are contributing back to the open source community, and the exciting improvements that are in the works to make Presto an even more powerful option for all of your analytics.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Dipti Borkar, cofounder Ahana about Presto and Ahana, SaaS managed service for Presto
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Ahana is and the story behind it?
There has been a lot of recent activity in the Presto community. Can you give an overview of the options that are available for someone wanting to use its SQL engine for querying their data?
What is Ahana’s role in the community/ecosystem?
(happy to skip this question if it’s too contentious) What are some of the notable differences that have emerged over the past couple of years between the Trino (formerly PrestoSQL) and PrestoDB projects?
Another area that has been seeing a lot of activity is data lakes and projects to make them more manageable and feature complete (e.g. Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.). How has that influenced your product focus and capabilities?
How does this activity change the calculus for organizations who are deciding on a lake or warehouse for their data architecture?
Can you describe how the Ahana Cloud platform is architected?
What are the additional systems that you have built to manage deployment, scaling, and multi-tenancy?
Beyond the storage and processing, what are the other notable tools and projects that have become part of the overall stack for supporting open analytics?
What are some areas of ongoing activity that you are keeping an eye on as you build out the Ahana offerings?
What are the most interesting, innovative, or unexpected ways that you have seen Ahana/Presto used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ahana?
When is Ahana the wrong choice?
What do you have planned for the future of Ahana?
Contact Info
LinkedIn
@dborkar on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Ahana
Alluxio
Podcast Episode
Couchbase
Kinetica
Tensorflow
PyTorch
Podcast.__init__ Episode
AWS Athena
AWS Glue
Hive Metastore
Clickhouse
Dremio
Podcast Episode
Apache Drill
Teradata
Snowflake
Podcast Episode
BigQuery
RaptorX
Aria Optimizations for Presto
Apache Ranger
Presto Plugin
Trino
Podcast Episode
Starburst
Podcast Episode
Hive
Iceberg
Podcast Episode
Hudi
Podcast Episode
Delta Lake
Podcast Episode
Superset
Podcast.__init__ Episode
Data Engineering Podcast Episode
Nessie
LakeFS
Amundsen
Podcast Episode
DataHub
Podcast Episode
OtterTune
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 28, 2021 • 51min
Do Away With Data Integration Through A Dataware Architecture With Cinchy
Summary
The reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a "Dataware" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Your host is Tobias Macey and today I’m interviewing Dan DeMers about Cinchy, a dataware platform aiming to simplify the work of data integration by eliminating ETL/ELT
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Cinchy is and the story behind it?
In your experience working in data and building complex enterprise-grade systems, what are the shortcomings and negative externalities of an ETL/ELT approach to data integration?
How is a Dataware platform from a data lake or data warehouses? What is it used for?
What is Zero-Copy Integration? How does that work?
Can you describe how customers start their Cinchy journey?
What are the main use case patterns that you’re seeing with Dataware?
Your platform offers unlimited users, including business users. What are some of the challenges that you face in building a user experience that doesn’t become overwhelming as an organization scales the number of data sources and processing flows?
What are the most interesting, innovative, or unexpected ways that you have seen Cinchy used?
When is Cinchy the wrong choice for a customer?
Can you describe the technical architecture of the Cinchy platform?
How do you establish connections/relationships among data from disparate sources?
How do you manage schema evolution in source systems?
What are some of the edge cases that users need to consider as they are designing and building those connections?
What are some of the features or capabilities of Cinchy that you think are overlooked or under-utilized?
How has your understanding of the problem space changed since you started working on Cinchy?
How has the architecture and design of the system evolved to reflect that updated understanding?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cinchy?
What do you have planned for the future of Cinchy?
Contact Info
LinkedIn
@dandemers on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Cinchy
Gordon Everest
Data Collaboration Alliance
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 25, 2021 • 58min
Decoupling Data Operations From Data Infrastructure Using Nexla
Summary
The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Saket Saurabh and Avinash Shahdadpuri about Nexla, a platform for powering data operations and sharing within and across businesses
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Nexla is and the story behind it?
What are the major problems that Nexla is aiming to solve?
What are the components of a data platform that Nexla might replace?
What are the use cases and benefits of being able to publish data sets for use outside and across organizations?
What are the different elements involved in implementing DataOps?
How is the Nexla platform implemented?
What have been the most comple engineering challenges?
How has the architecture changed or evolved since you first began working on it?
What are some of the assumptions that you had at the start which have been challenged or invalidated?
What are some of the heuristics that you have found most useful in generating logical units of data in an automated fashion?
Once a Nexset has been created, what are some of the ways that they can be used or further processed?
What are the attributes of a Nexset? (e.g. access control policies, lineage, etc.)
How do you handle storage and sharing of a Nexset?
What are some of your grand hopes and ambitions for the Nexla platform and the potential for data exchanges?
What are the most interesting, innovative, or unexpected ways that you have seen Nexla used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Nexla?
When is Nexla the wrong choice?
What do you have planned for the future of Nexla?
Contact Info
Saket
LinkedIn
@saketsaurabh on Twitter
Avinash
LinkedIn
@avinashpuri on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Nexla
Nexsets
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 21, 2021 • 28min
Let Your Analysts Build A Data Lakehouse With Cuelake
Summary
Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Your host is Tobias Macey and today I’m interviewing Vikrant Dubey about Cuebook and their Cuelake project for building ELT pipelines for your data lakehouse entirely in SQL
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Cuelake is and the story behind it?
There are a number of platforms and projects for running SQL workloads and transformations on a data lake. What was lacking in those systems that you are addressing with Cuelake?
Who are the target users of Cuelake and how has that influenced the features and design of the system?
Can you describe how Cuelake is implemented?
What was your selection process for the various components?
What are some of the sharp edges that you have had to work around when integrating these components?
What involved in getting Cuelake deployed?
How are you using Cuelake in your work at Cuebook?
Given your focus on machine learning for anomaly detection of business metrics, what are the challenges that you faced in using a data warehouse for those workloads?
What are the advantages that a data lake/lakehouse architecture maintains over a warehouse?
What are the shortcomings of the lake/lakehouse approach that are solved by using a warehouse?
What are the most interesting, innovative, or unexpected ways that you have seen Cuelake used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cuelake?
When is Cuelake the wrong choice?
What do you have planned for the future of Cuelake?
Contact Info
LinkedIn
vikrantcue on GitHub
@vkrntd on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Cuelake
Apache Druid
Dremio
Databricks
Zeppelin
Spark
Apache Iceberg
Apache Hudi
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 18, 2021 • 1h 6min
Migrate And Modify Your Data Platform Confidently With Compilerworks
Summary
A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Shevek about Compilerworks and his work on writing compilers to automate data lineage tracking from your SQL code
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Compilerworks is and the story behind it?
What is a compiler?
How are you applying compilers to the challenges of data processing systems?
What are some use cases that Compilerworks is uniquely well suited to?
There are a number of other methods and systems available for tracking and/or computing data lineage. What are the benefits of the approach that you are taking with Compilerworks?
Can you describe the design and implementation of the Compilerworks platform?
How has the system changed or evolved since you first began working on it?
What programming languages and SQL dialects do you currently support?
Which have been the most challenging to work with?
How do you handle verification/validation of the algebraic representation of SQL code given the variability of implementations and the flexibility of the specification?
Can you talk through the process of getting Compilerworks integrated into a customer’s infrastructure?
What is a typical workflow for someone using Compilerworks to manage their data lineage?
How does Compilerworks simplify the process of migrating between data warehouses/processing platforms?
What are the most interesting, innovative, or unexpected ways that you have seen Compilerworks used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Compilerworks?
When is Compilerworks the wrong choice?
What do you have planned for the future of Compilerworks?
Contact Info
@shevek on GitHub
Webiste
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Compilerworks
Compiler
ANSI SQL
Spark SQL
Google Flume Paper
SAS
Informatica
Trie Data Structure
Satisfiability Solver
Lisp
Scheme
Snooker
Qemu Java API
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 15, 2021 • 49min
Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop
Summary
The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Your host is Tobias Macey and today I’m interviewing Davit Buniatyan about Activeloop, a platform for hosting and delivering datasets optimized for machine learning
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Activeloop is and the story behind it?
How does the form and function of data storage introduce friction in the development and deployment of machine learning projects?
How does the work that you are doing at Activeloop compare to vector databases such as Pinecone?
You have a focus on image oriented data and computer vision projects. How does the specific applications of ML/DL influence the format and interactions with the data?
Can you describe how the Activeloop platform is architected?
How have the design and goals of the system changed or evolved since you began working on it?
What are the feature and performance tradeoffs between self-managed storage locations (e.g. S3, GCS) and the Activeloop platform?
What is the process for sourcing, processing, and storing data to be used by Hub/Activeloop?
Many data assets are useful across ML/DL and analytical purposes. What are the considerations for managing the lifecycle of data between Activeloop/Hub and a data lake/warehouse?
What do you see as the opportunity and effort to generalize Hub and Activeloop to support arbitrary ML frameworks/languages?
What are the most interesting, innovative, or unexpected ways that you have seen Activeloop and Hub used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Activeloop?
When is Hub/Activeloop the wrong choice?
What do you have planned for the future of Activeloop?
Contact Info
LinkedIn
@DBuniatyan on Twitter
davidbuniat on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Activeloop
Slack Community
Princeton University
ImageNet
Tensorflow
PyTorch
Podcast Episode
Activeloop Hub
Delta Lake
Podcast Episode
Tensor
Wasabi
Ray/Anyscale
Podcast Episode
Humans In The Loop podcast
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 10, 2021 • 53min
Build Trust In Your Data By Understanding Where It Comes From And How It Is Used With Stemma
Summary
All of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in the answers. Stemma helps you establish and maintain that trust by giving visibility into who is using what data, annotating the reports with useful context, and understanding who is responsible for keeping it up to date. In this episode Mark Grover explains what he is building at Stemma, how it expands on the success of the Amundsen project, and why trust is the most important asset for data teams.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Mark Grover about his work at Stemma to bring the Amundsen project to a wider audience and increase trust in their data.
Interview
Introduction
Can you describe what Stemma is and the story behind it?
Can you give me more context into how and why Stemma fits into the current data engineering world? Among the popular tools of today for data warehousing and other products that stitch data together – what is Stemma’s place? Where does it fit into the workflow?
How has the explosion in options for data cataloging and discovery influenced your thinking on the necessary feature set for that class of tools? How do you compare to your competitors
With how long we have been using data and building systems to analyze it, why do you think that trust in the results is still such a momentous problem?
Tell me more about Stemma and how it compares to Amundsen?
Can you tell me more about the impact of Stemma/Amundsen to companies that use it?
What are the opportunities for innovating on top of Stemma to help organizations streamline communication between data producers and consumers?
Beyond the technological capabilities of a data platform, the bigger question is usually the social/organizational patterns around data. How have the "best practices" around the people side of data changed in the recent past?
What are the points of friction that you continue to see?
A majority of conversations around data catalogs and discovery are focused on analytical usage. How can these platforms be used in ML and AI workloads?
How has the data engineering world changed since you left Lyft/since we last spoke? How do you see it evolving in the future?
Imagine 5 years down the line and let’s say Stemma is a household name. How have data analysts’ lives improved? Data engineers? Data scientists?
What are the most interesting, innovative, or unexpected ways that you have seen Stemma used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stemma?
When is Stemma the wrong choice?
What do you have planned for the future of Stemma?
Contact Info
LinkedIn
Email
@mark_grover on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Stemma
Amundsen
Podcast Episode
CSAT == Customer Satisfaction
Data Mesh
Podcast Episode
Feast open source feature store
Supergrain
Transform
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 7, 2021 • 53min
Data Discovery From Dashboards To Databases With Castor
Summary
Every organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to dashboards. The other challenge is that even if you do find the information you are seeking, there might not be enough context available to determine how to use it or what it means. Castor is building a data discovery platform aimed at solving this problem, allowing you to search for and document details about everything from a database column to a business intelligence dashboard. In this episode CTO Amaury Dumoulin shares his perspective on the complexity of letting everyone in the company find answers to their questions and how Castor is designed to help.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Your host is Tobias Macey and today I’m interviewing Amaury Dumoulin about Castor, a managed platform for easy data cataloging and discovery
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Castor is and the story behind it?
The market for data catalogues is nascent but growing fast. What are the broad categories for the different products and projects in the space?
What do you see as the core features that are required to be competitive?
In what ways has that changed in the past 1 – 2 years?
What are the opportunities for innovation and differentiation in the data catalog/discovery ecosystem?
How do you characterize your current position in the market?
Who are the target users of Castor?
Can you describe the technical architecture and implementation of the Castor platform?
How have the goals and design changed since you first began working on it?
Can you talk through the workflow of getting Castor set up in an organization and onboarding the users?
What are the design elements and platform features that allow for serving the various roles and stakeholders in an organization?
What are the organizational benefits that you have seen from users adopting Castor or other data discovery/catalog systems?
What are the most interesting, innovative, or unexpected ways that you have seen Castor used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Castor?
When is Castor the wrong choice?
What do you have planned for the future of Castor?
Contact Info
Amaury Dumoulin
Castor website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Castor
Atlan
Podcast Episode
dbt
Podcast Episode
Monte Carlo
Podcast Episode
Collibra
Podcast Episode
Amundsen
Podcast Episode
Airflow
Podcast Episode
Metabase
Podcast Episode
Airbyte
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 3, 2021 • 1h 10min
Charting A Path For Streaming Data To Fill Your Data Lake With Hudi
Summary
Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Hudi is and the story behind it?
What are the use cases that it is focused on supporting?
There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.?
Can you describe how Hudi is architected?
How have the goals and design of Hudi changed or evolved since you first began working on it?
If you were to start the whole project over today, what would you do differently?
Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment?
One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?
How does Hudi make that a tractable problem?
What are the data platform components that are needed to support an installation of Hudi?
What is involved in migrating an existing data lake to use Hudi?
How would someone approach supporting heterogeneous table formats in their lake?
As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem?
What are the most interesting, innovative, or unexpected ways that you have seen Hudi used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi?
When is Hudi the wrong choice?
What do you have planned for the future of Hudi?
Contact Info
Linkedin
Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Hudi Docs
Hudi Design & Architecture
Incremental Processing
CDC == Change Data Capture
Podcast Episodes
Oracle GoldenGate
Voldemort
Kafka
Hadoop
Spark
HBase
Parquet
Iceberg Table Format
Data Engineering Episode
Hive ACID
Apache Kudu
Podcast Episode
Vertica
Delta Lake
Podcast Episode
Optimistic Concurrency Control
MVCC == Multi-Version Concurrency Control
Presto
Flink
Podcast Episode
Trino
Podcast Episode
Gobblin
LakeFS
Podcast Episode
Nessie
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast