

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

13 snips
Sep 26, 2022 • 41min
Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language
Summary
Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL. In this episode Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Tom Baeyens about Soda Data’s new DSL for data reliability
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what SodaCL is and the story behind it?
What is the scope of functionality that SodaCL is intended to address?
What are the ways that reliability is measured for data assets? (what is the equivalent to site uptime?)
What are the core abstractions that you identified for simplifying the declaration of data validations?
How did you approach the design of the SodaCL syntax to balance flexibility for various use cases, with structure and opinionated application?
Why YAML?
Can you describe how the Soda Core utility is implemented?
How have the design and scope of the SodaCL dialect and the Soda Core framework evolved since you started working on them?
What are the available integration/extension points for teams who are using Soda Core?
Can you describe how SodaCL integrates into the workflow of data and analytics engineers?
What is your process for evolving the SodaCL dialect in a maintainable and sustainable manner?
What are the most interesting, innovative, or unexpected ways that you have seen SodaCL used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on SodaCL?
When is SodaCL the wrong choice?
What do you have planned for the future of SodaCL?
Contact Info
LinkedIn
@tombaeyens on Twitter
tombaeyens on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Soda Data
Podcast Episode
Soda Checks Language
Great Expectations
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 19, 2022 • 55min
Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream
Summary
There is a constant tension in business data between growing siloes, and breaking them down. Even when a tool is designed to integrate information as a guard against data isolation, it can easily become a silo of its own, where you have to make a point of using it to seek out information. In order to help distribute critical context about data assets and their status into the locations where work is being done Nicholas Freund co-founded Workstream. In this episode he discusses the challenge of maintaining shared visibility and understanding of data work across the various stakeholders and his efforts to make it a seamless experience.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Nicholas Freund about Workstream, a platform aimed at providing a single pane of glass for analytics in your organization
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Workstream is and the story behind it?
What is the core problem that you are trying to solve at Workstream?
How does that problem manifest for the different stakeholders in an organization?
What are the contributing factors that lead to fragmentation of visibility for data workflows at different stages?
What are the sources of information that you use to build a cohesive view of an organization’s data assets?
What are the lifecycle stages of a data asset that are most often overlooked or un-maintained?
What are the risks and challenges associated with retirement of a data asset?
Can you describe how Workstream is implemented?
How have the design and goals of the system changed since you first started it?
What does the day-to-day interaction with workstream look like for different roles in a company?
What are the long-range impacts on team behaviors/productivity/capacity that you hope to catalyze?
What are the most interesting, innovative, or unexpected ways that you have seen Workstream used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Workstream?
When is Workstream the wrong choice?
What do you have planned for the future of Workstream?
Contact Info
LinkedIn
@nickfreund on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Workstream
Data Catalog
Entropy
CDP == Customer Data Platform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 19, 2022 • 1h 32min
Operational Analytics To Increase Efficiency For Multi-Location Businesses With OpsAnalitica
Summary
In order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location businesses and their operational practices. He discusses the challenges of making data collection purposeful and efficient without distracting employees from their primary duties and how business owners can use the provided analytics to support their staff in their duties.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
Your host is Tobias Macey and today I’m interviewing Tommy Yionoulis about using data to improve efficiencies in multi-location service businesses with OpsAnalitica
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what OpsAnalitica is and the story behind it?
What are some examples of the types of questions that business owners and site managers need to answer in order to run their operations?
What are the sources of information that are needed to be able to answer these questions?
In the absence of a platform like OpsAnalitica, how are business operations getting the answers to these questions?
What are some of the sources of inefficiency that they are contending with?
How do those inefficiencies compound as you scale the number of locations?
Can you describe how the OpsAnalitica system is implemented?
How have the design and goals of the platform evolved since you started working on it?
Can you describe the workflow for a business using OpsAnalitica?
What are some of the biggest integration challenges that you have to address?
What are some of the design elements that you have invested in to reduce errors and complexity for employees tracking relevant metrics?
What are the most interesting, innovative, or unexpected ways that you have seen OpsAnalitica used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpsAnalitica?
When is OpsAnalitica the wrong choice?
What do you have planned for the future of OpsAnalitica?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
OpsAnalitica
Quiznos
FormRouter
Cooper Atkins(?)
SensorThings API
The Founder movie
Toast
Looker
Podcast Episode
Power BI
Podcast Episode
Pareto Principle
Decisions workflow platform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 12, 2022 • 60min
Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata
Summary
Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Schemata is and the story behind it?
How does the garbage in/garbage out problem manifest in data warehouse/data lake environments?
What are the different places in a data system that schema definitions need to be established?
What are the different ways that schema management gets complicated across those various points of interaction?
Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?
How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?
How is the Schemata utility implemented?
What are some of the design and scope questions that you had to work through while developing Schemata?
What is the broad vision that you have for Schemata and its impact on data practices?
How are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins?
The core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed?
What are the pieces of Schemata and its usage that are still undefined?
What are the most interesting, innovative, or unexpected ways that you have seen Schemata used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata?
When is Schemata the wrong choice?
What do you have planned for the future of Schemata?
Contact Info
ananthdurai on GitHub
@ananthdurai on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Schemata
Data Engineering Weekly
Zendesk
Ralph Kimball
Data Warehouse Toolkit
Iteratively
Podcast Episode
Protocol Buffers (protobuf)
Application Tracing
OpenTelemetry
Django
Spring Framework
Dependency Injection
JSON Schema
dbt
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 12, 2022 • 57min
Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data
Summary
Any business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data. In this episode he shares his journey from building a consumer product to launching a data pipeline service and how his frustrations as a product owner have informed his work at Hevo Data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Manish Jethani about Hevo Data’s experiences navigating the modern data stack and the role of ELT in data workflows
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Hevo Data is and the story behind it?
What is the core problem that you are trying to solve with the Hevo platform?
What are the target personas of who will bring Hevo into a company and who will be using/interacting with it for their day-to-day?
What are some of the lessons that you learned building a product that relied on data to function which you have carried into your work at Hevo, providing the utilities that enable other businesses and products?
There are numerous commercial and open source options for collecting, transforming, and integrating data. What are the differentiating features of Hevo?
What are your views on the benefits of a vertically integrated platform for data flows in the world of the disaggregated "modern data stack"?
Can you describe how the Hevo platform is implemented?
What are some of the optimizations that you have invested in to support the aggregate load from your customers?
The predominant pattern in recent years for collecting and processing data is ELT. In your work at Hevo, what are some of the nuance and exceptions to that "best practice" that you have encountered?
How have you factored those learnings back into the product?
mechanics of schema mapping
edge cases that require human intervention
how to surface those in a timely fashion
What is the process for onboarding onto the Hevo platform?
Once an organization has adopted Hevo, can you describe the workflow of building/maintaining/evolving data pipelines?
What are the most interesting, innovative, or unexpected ways that you have seen Hevo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hevo?
When is Hevo the wrong choice?
What do you have planned for the future of Hevo?
Contact Info
LinkedIn
@ManishJethani on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Hevo Data
Kafka
MongoDB
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Sifflet: 
Sifflet is a Full Data Stack Observability platform acting as an overseeing layer to the Data Stack, ensuring that data is reliable from ingestion to consumption. Whether the data is in transit or at rest, Sifflet is able to detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack.
In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. We also offer a 2-week free trial.
Go to [dataengineeringpodcast.com/sifflet](https://www.dataengineeringpodcast.com/sifflet) to find out more.Support Data Engineering Podcast

Sep 5, 2022 • 54min
Introduce Climate Analytics Into Your Data Platform Without The Heavy Lifting Using Sust Global
Summary
The global climate impacts everyone, and the rate of change introduces many questions that businesses need to consider. Getting answers to those questions is challenging, because the climate is a multidimensional and constantly evolving system. Sust Global was created to provide curated data sets for organizations to be able to analyze climate information in the context of their business needs. In this episode Gopal Erinjippurath discusses the data engineering challenges of building and serving those data sets, and how they are distilling complex climate information into consumable facts so you don’t have to be an expert to understand it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Gopal Erinjippurath about his work at Sust Global building data sets from geospatial and satellite information to power climate analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Sust Global is and the story behind it?
What audience(s) are you focused on?
Climate change is obviously a huge topic in the zeitgeist and has been growing in importance. What are the data sources that you are working with to derive climate information?
What role do you view Sust Global having in addressing climage change?
How are organizations using your climate information assets to inform their analytics and business operations?
What are the types of questions that they are asking about the role of climate (present and future) for their business activities?
How can they use the climate information that you provide to understand their impact on the planet?
What are some of the educational efforts that you need to undertake to ensure that your end-users understand the context and appropriate semantics of the data that you are providing? (e.g. concepts around climate science, statistically meaningful interpretations of aggregations, etc.)
Can you describe how you have architected the Sust Global platform?
What are some examples of the types of data workflows and transformations that are necessary to maintain your customer-facing services?
How have you approached the question of modeling for the data that you provide to end-users to make it straightforward to integrate and analyze the information?
What is your process for determining relevant granularities of data and normalizing scales? (e.g. time and distance)
What is involved in integrating with the Sust Global platform and how does it fit into the workflow of data engineers/analysts/data scientists at your customer organizations?
Any analytical task is an exercise in story-telling. What are some of the techniques that you and your customers have found useful to make climate data relatable and understandable?
What are some of the challenges involved in mapping between micro and macro level insights and translating them effectively for the consumer?
How does the increasing sensor capabilities and scale of coverage manifest in your data?
How do you account for increasing coverage when analyzing across longer historical time scales?
How do you balance the need to build a sustainable business with the importance of access to the information that you are working with?
What are the most interesting, innovative, or unexpected ways that you have seen Sust Global used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Sust Global?
When is Sust the wrong choice?
What do you have planned for the future of Sust Global?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Sust Global
Planet Labs
Carbon Capture
IPCC
Data Lodge(?)
6th Assessment Report
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 5, 2022 • 59min
A Reflection On Data Observability As It Reaches Broader Adoption
Summary
Data observability is a product category that has seen massive growth and adoption in recent years. Monte Carlo is in the vanguard of companies who have been enabling data teams to observe and understand their complex data systems. In this episode founders Barr Moses and Lior Gavish rejoin the show to reflect on the evolution and adoption of data observability technologies and the capabilities that are being introduced as the broader ecosystem adopts the practices.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about the state of the market for data observability and their own work at Monte Carlo
Interview
Introduction
How did you get involved in the area of data management?
Can you give the elevator pitch for Monte Carlo?
What are the notable changes in the Monte Carlo product and business since our last conversation in October 2020?
You were one of the early entrants in the market of data quality/data observability products. In your work to gain visibility and traction you invested substantially in content creation (blog posts, presentations, round table conversations, etc.). How would you summarize the focus of your initial efforts?
Why do you think data observability has really taken off? A few years ago, the category barely existed – what’s changed?
There’s a larger debate within the data engineering community regarding whether it makes sense to go deep or go broad when it comes to monitoring your data. In other words, do you start with a few important data sets, or do you attempt to cover the entire ecosystem. What is your take?
For engineers and teams who are just now investigating and investing in observability/quality automation for their data, what are their motivations?
How has the conversation around the value/motivating factors matured or changed over the past couple of years?
In what way have the requirements and capabilities of data observability platforms shifted?
What are the forces in the ecosystem that have driven those changes?
How has the scope and vision for your work at Monte Carlo evolved as the understanding and impact of data quality have become more widespread?
When teams invest in data quality/observability what are some of the ways that the insights gained influence their other priorities and design choices? (e.g. platform design, pipeline design, data usage, etc.)
When it comes to selecting what parts of the data stack to invest in, how do data leaders prioritize? For instance, when does it make sense to build or buy a data catalog? A data observability platform?
The adoption of any tool that adds constraints is a delicate balance. What have you found to be the predominant patterns for teams who are incorporating Monte Carlo? (e.g. maintaining delivery velocity and adding safety/trust)
A corollary to the goal of data engineers for higher reliability and visibility is the need by the business/team leadership to identify "return on investment". How do you and your customers think about the useful metrics and measurement goals to justify the time spent on "non-functional" requirements?
What are the most interesting, innovative, or unexpected ways that you have seen Monte Carlo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Monte Carlo?
When is Monte Carlo the wrong choice?
What do you have planned for the future of Monte Carlo?
Contact Info
Barr
LinkedIn
@BM_DataDowntime on Twitter
Lior
LinkedIn
@lgavish on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Monte Carlo
Podcast Episode
App Dynamics
Datadog
New Relic
Data Quality Fundamentals book
State Of Data Quality Survey
dbt
Podcast Episode
Airflow
Dagster
Podcast Episode
Episode: Incident Management For Data Teams
Databricks Delta
Patch.tech Snowflake APIs
Hightouch
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 29, 2022 • 1h 4min
An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality
Summary
The dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macey and today I’m interviewing Sean Knapp about the role of data automation in building maintainable systems
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you mean by the term "data automation" and the assumptions that it includes?
One of the perennial challenges of automation is that there are always steps that are resistant to being performed without human involvement. What are some of the tasks that you have found to be common problems in that sense?
What are the different concerns that need to be included in a stack that supports fully automated data workflows?
There was recently an interesting article suggesting that the "left-to-right" approach to data workflows is backwards. In your experience, what would be required to allow for triggering data processes based on the needs of the data consumers? (e.g. "make sure that this BI dashboard is up to date every 6 hours")
What are the tasks that are most complex to build automation for?
What are some companies or tools/platforms that you consider to be exemplars of "data automation done right"?
What are the common themes/patterns that they build from?
How have you approached the need for data automation in the implementation of the Ascend product?
How have the requirements for data automation changed as data plays a more prominent role in a growing number of businesses?
What are the foundational elements that are unchanging?
What are the most interesting, innovative, or unexpected ways that you have seen data automation implemented?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data automation at Ascend?
What are some of the ways that data automation can go wrong?
What are you keeping an eye on across the data ecosystem?
Contact Info
@seanknapp on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Ascend
Podcast Episode
Google Sawzall
CI/CD
Airflow
Kubernetes
Ascend FlexCode
MongoDB
SHA == Secure Hash Algorithm
dbt
Podcast Episode
Materialized View
Great Expectations
Podcast Episode
Monte Carlo
Podcast Episode
OpenLineage
Podcast Episode
Open Metadata
Podcast Episode
Egeria
OOM == Out Of Memory Manager
Five Whys
Data Mesh
Data Fabric
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Bigeye: 
Bigeye is an industry-leading data observability platform that gives data engineering and science teams the tools they need to ensure their data is always fresh, accurate and reliable. Companies like Instacart, Clubhouse, and Udacity use Bigeye’s automated data quality monitoring, ML-powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business.
Go to [dataengineeringpodcast.com/bigeye](https://www.dataengineeringpodcast.com/bigeye) today and start trusting your data.
Support Data Engineering Podcast

38 snips
Aug 28, 2022 • 1h 10min
Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations
Summary
AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Lindsay Pettingill Chetan Sharma, Swaroop Jagadish, Maxime Beauchemin, and Nick Handel about the lessons that they learned in their time at AirBnB and how they are carrying that forward to their respective companies
Interview
Introduction
How did you get involved in the area of data management?
You all worked at AirBnB in similar time frames and then went on to found data-focused companies that are finding success in their respective categories. Do you consider it an outgrowth of the specific company culture/work involved or a curiosity of the moment in time for the data industry that led you each in that direction?
What are the elements of AirBnB’s data culture that you feel were done right?
What do you see as the critical decisions/inflection points in the company’s growth that led you down that path?
Every journey has its detours and dead-ends. What are the mistakes that were made (individual and collective) that were most instructive for you?
What about that experience and other experiences led you each to go our respective directions with data startups?
Was your motivation to start a company addressing the work that you did at AirBnB due to the desire to build on existing success, or the need to fix a nagging frustration?
What are the critical lessons for data teams that you are focused on teaching to engineers inside and outside your company?
What are your predictions for the next 5 years of data?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on translating your experiences at AirBnB into successful products?
Contact Info
Lindsay
LinkedIn
@lpettingill on Twitter
Chetan
LinkedIn
@chesharma87 on Twitter
Maxime
mistercrunch on GitHub
LinkedIn
@mistercrunch on Twitter
Swaroop
swaroopjagadish on GitHub
LinkedIn
@arudis on Twitter
Nick
LinkedIn
@NicholasHandel on Twitter
nhandel on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
Iggy
Eppo
Podcast Episode
Acryl
Podcast Episode
DataHub
Preset
Superset
Podcast Episode
Airflow
Transform
Podcast Episode
Deutsche Bank
Ubisoft
BlackRock
Kafka
Pinot
Stata
R
Knowledge-Repo
Podcast.__init__ Episode
AirBnB Almond Flour Cookie Recipe
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 22, 2022 • 1h 6min
An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications
Summary
Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Shruti Bhat about the growth of real-time data applications and the systems required to support them
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what is driving the adoption of real-time analytics?
architectural patterns for real-time analytics
sources of latency in the path from data creation to end-user
end-user/customer expectations for time to insight
differing expectations between internal and external consumers
scales of data that are reasonable for real-time vs. batch
What are the most interesting, innovative, or unexpected ways that you have seen real-time architectures implemented?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rockset?
When is Rockset the wrong choice?
What do you have planned for the future of Rockset?
Contact Info
LinkedIn
@shrutibhat on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Rockset
Podcast Episode
Embedded Analytics
Confluent
Kafka
AWS Kinesis
Lambda Architecture
Data Observability
Data Mesh
DynamoDB Streams
MongoDB Change Streams
Bigeye
Monte Carlo Data
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast