
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

Oct 29, 2021 • 1h 9min
Removing The Barrier To Exploratory Analytics with Activity Schema and Narrator
Summary
The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Ahmed Elsamadisi about Narrator, a platform to enable anyone to go from question to data-driven decision in minutes
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Narrator is and the story behind it?
What are the challenges that you have seen organizations encounter when attempting to make analytics a self-serve capability?
What are the use cases that you are focused on?
How does Narrator fit within the data workflows of an organization?
How is the Narrator platform implemented?
How has the design and focus of the technology evolved since you first started working on Narrator?
The core element of the analyses that you are building is the "activity schema". Can you describe the design process that led you to that format?
What are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault?
How does the activity schema address those challenges?
What are the performance characteristics of deriving models from an activity schema/timeseries table?
For someone who wants to use Narrator, what is involved in transforming their data to map into the activity schema?
Can you talk through the domain modeling that needs to happen when determining what entities and actions to capture?
What are the most interesting, innovative, or unexpected ways that you have seen Narrator used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Narrator?
When is Narrator the wrong choice?
What do you have planned for the future of Narrator?
Contact Info
LinkedIn
@ae4ai on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Narrator
DARPA Challenge
Fivetran
Luigi
Chartio
Airflow
Domain Driven Design
Data Vault
Snowflake Schema
Event Sourcing
Census
Podcast Episode
Hightouch
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 29, 2021 • 1h 10min
Streaming Data Pipelines Made SQL With Decodable
Summary
Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Eric Sammer about Decodable, a platform for simplifying the work of building real-time data pipelines
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Decodable is and the story behind it?
Who are the target users, and how has that focus informed your prioritization of features at launch?
What are the complexities that data engineers encounter when building pipelines on streaming systems?
What are the distributed systems concepts and design optimizations that are often skipped over or misunderstood by engineers who are using them? (e.g. backpressure, exactly once semantics, isolation levels, etc.)
How do those mismatches in understanding and expectation impact the correctness and reliability of the workflows that they are building?
Can you describe how you have architected the Decodable platform?
What have been the most complex or time consuming engineering challenges that you have dealt with so far?
What are the points of integration that you expose for engineers to wire in their existing infrastructure and data systems?
What has been your process for designing the interfaces and abstractions that you are exposing to end users?
What are some of the leaks in those abstractions that have either started to show or are anticipated?
What have you learned about the state of data engineering and the costs and benefits of real-time data while working on Decodable?
What are the most interesting, innovative, or unexpected ways that you have seen Decodable used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable?
When is Decodable the wrong choice?
What do you have planned for the future of Decodable?
Contact Info
esammer on GitHub
@esammer on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Decodable
Cloudera
Kafka
Flink
Podcast Episode
Spark
Snowflake
Podcast Episode
BigQuery
RedShift
kSQLDB
Podcast Episode
dbt
Podcast Episode
Millwheel Paper
Dremel Paper
Timely Dataflow
Materialize
Podcast Episode
Software Defined Networking
Data Mesh
Podcast Episode
OpenLineage
Podcast Episode
DataHub
Podcast Episode
Amundsen
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 23, 2021 • 1h 6min
Data Exploration For Business Users Powered By Analytics Engineering With Lightdash
Summary
The market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change has been the rise of analytics engineering powered by dbt. Lightdash has fully embraced that shift by building an entire open source business intelligence framework that is powered by dbt models. In this episode Oliver Laslett describes why dashboards aren’t sufficient for business analytics, how Lightdash promotes the work that you are already doing in your data warehouse modeling with dbt, and how they are focusing on bridging the divide between data teams and business teams and the requirements that they have for data workflows.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Oliver Laslett about Lightdash, an open source business intelligence system powered by your dbt models
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Lightdash is and the story behind it?
What are the main goals of the project?
Who are the target users, and how has that profile informed your feature priorities?
Business intelligence is a market that has gone through several generational shifts, with products targeting numerous personas and purposes. What are the capabilities that make Lightdash stand out from the other options?
Can you describe how Lightdash is architected?
How have the design and goals of the system changed or evolved since you first began working on it?
What have been the most challenging engineering problems that you have dealt with?
How does the approach that you are taking with Lightdash compare to systems such as Transform and Metriql that aim to provide a dedicated metrics layer?
Can you describe the workflow for someone building an analysis in Lightdash?
What are the points of collaboration around Lightdash for different roles in the organization?
What are the methods that you use to expose information about the state of the underlying dbt models to the end users?
How do they use that information in their exploration and decision making?
What was your motivation for releasing Lightdash as open source?
How are you handling the governance and long-term viability of the project?
What are the most interesting, innovative, or unexpected ways that you have seen Lightdash used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Lightdash?
When is Lightdash the wrong choice?
What do you have planned for the future of Lightdash?
Contact Info
LinkedIn
owlas on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Lightdash
Looker
Podcast Episode
PowerBI
Podcast Episode
Redash
Podcast Episode
Metabase
Podcast Episode
dbt
Podcast Episode
Superset
Podcast Episode
Streamlit
Podcast Episode
Kubernetes
JDBC
SQLAlchemy
SQLPad
Singer
Podcast Episode
Airbyte
Podcast Episode
Meltano
Podcast Episode
Transform
Podcast Episode
Metriql
Podcast Episode
Cube.js
OpenLineage
Podcast Episode
dbt Packages
Rudderstack
PostHog
Podcast Interview
Firebolt
Podcast Interview
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 21, 2021 • 1h 9min
Completing The Feedback Loop Of Data Through Operational Analytics With Census
Summary
The focus of the past few years has been to consolidate all of the organization’s data into a cloud data warehouse. As a result there have been a number of trends in data that take advantage of the warehouse as a single focal point. Among those trends is the advent of operational analytics, which completes the cycle of data from collection, through analysis, to driving further action. In this episode Boris Jabes, CEO of Census, explains how the work of synchronizing cleaned and consolidated data about your customers back into the systems that you use to interact with those customers allows for a powerful feedback loop that has been missing in data systems until now. He also discusses how Census makes that synchronization easy to manage, how it fits with the growth of data quality tooling, and how you can start using it today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Boris Jabes about Census and the growing category of operational analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Census is and the story behind it?
The terms "reverse ETL" and "operational analytics" have started being used for similar, and often interchangeable, purposes. What are your thoughts on the semantic and concrete differences between these phrases?
What are the motivating factors for adding operational analytics or "data activation" to an organization’s data platform?
This is a nascent but quickly growing market with a number of products and projects operating in the space. How would you characterize the current state of the segment and Census’ position in it?
Can you describe how the Census platform is implemented?
What are some of the early design choices that have had to be refactored or augmented as you have evolved the product and worked with customers?
What are some of the assumptions that you had about the needs and uses for the platform which have been challenged or changed as you dug deeper into the problem?
Can you describe the workflow for a customer adopting Census?
What are some of the data modeling practices that make it easier to "activate" the organization’s data?
Another recent trend in the data industry is the growth of data quality and data lineage tools. What is involved in using the measured quality or lineage information as a signal in the operational systems, or to prevent a synchronization?
How can users test and validate their workflows in Census?
What are the options for propagating Census’ runtime information back into lineage and data quality tracking?
Census supports incremental syncs from the warehouse. What are the opportunities for bringing streaming architectures to the space of operational analytics?
What are the challenges/complexities in the current set of technologies that act as a barrier?
What are the most interesting, innovative, or unexpected ways that you have seen Census used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Census?
When is Census the wrong choice?
What do you have planned for the future of Census?
Contact Info
LinkedIn
Website
@borisjabes on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Census
Operational Analytics
Fivetran
Podcast Episode
dbt
Podcast Episode
Snowflake
Podcast Episode
Loom
Materialize
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 16, 2021 • 1h 8min
Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data
Summary
The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity. In order to support the project and make it even easier to use for organizations of every size Shirshanka Das and Swaroop Jagadish founded Acryl Data. In this episode they discuss the recent work that has been done by the community, how their work is building on top of that foundation, and how you can get started with DataHub for your own work to manage data discovery today. They also share their ambitions for the near future of adding data observability and data quality management features.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
Your host is Tobias Macey and today I’m interviewing Shirshanka Das and Swaroop Jagadish about Acryl Data, the company driving the open source metadata project DataHub for powering data discovery, data observability and federated data governance.
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Acryl Data is and the story behind it?
How has your experience of building and running DataHub at LinkedIn informed your product direction for Acryl?
What are some lessons that your co-founder Swaroop has contributed from his experience at AirBnB?
The data catalog/discovery/quality market has become very active over the past year. What is your perspective on the market, and what are the gaps that are not yet being addressed?
How does the focus of Acryl compare to what the team at Metaphor are building?
How has the DataHub project changed in the past year with more companies outside of LinkedIn using and contributing to it?
What are your plans for Data Observability?
Can you describe the system architecture that you have built at Acryl?
What are the convenience features that you are building to augment the capabilities and integration process for DataHub?
What are some typical workflows that data teams build out when working with Acryl?
What are some examples of automated actions that can be triggered from metadata changes?
What are the available events that can be used to trigger actions?
What are some of the challenges that teams are facing when integrating metadata management and analysis into their data workflows?
What are your thoughts on the potential for the Open Lineage and Open metadata projects?
How is the governance of DataHub being managed?
What are the most interesting, innovative, or unexpected ways that you have seen Acryl/DataHub used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acryl/DataHub?
When is Acryl the wrong choice?
What do you have planned for the future of Acryl?
Contact Info
Shirshanka
LinkedIn
@shirshanka on Twitter
shirshanka on GitHub
Swaroop
LinkedIn
@arudis on Twitter
swaroopjagadish on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Acryl Data
DataHub
Hudi
Podcast Episode
Iceberg
Podcast Episode
Delta Lake
Podcast Episode
Apache Gobblin
Airflow
Superset
Podcast Episode
Collibra
Podcast Episode
Alation
Strata Conference Presentation
Acryl/DataHub Ingestion Framework
Joe Hellerstein
Trifacta
DataHub Roadmap
Data Mesh
OpenLineage
Podcast Episode
OpenMetadata
Egeria Open Metadata
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 14, 2021 • 1h 2min
How And Why To Become Data Driven As A Business
Summary
Organizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data. In this episode he discusses his experiences and how he approached the work of distilling them for his book "Fail Fast, Learn Faster". This is an entertaining and enlightening exploration of the business side of data with an industry veteran.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Randy Bean about his recent book focusing on the use of big data and AI for informing data driven business leadership
Interview
Introduction
How did you get involved in the area of data management?
Can you start by discussing the focus of the book and what motivated you to write it?
Who is the intended audience, and how did that inform the tone and content?
Businesses and their officers have been aiming to be "data driven" for years. In your experience, what are the concrete goals that are implied by that term?
What are the barriers that organizations encounter in the pursuit of those goals?
How have the success rates (real and imagined) shifted in recent years as the level of sophistication of the tools and industry for data management has increased?
What is the state of data initiatives in leading corporations today?
What are the biggest opportunities and risks that organizations focus on related to their use of data?
At what level(s) of the organization do lessons around data ethics need to be embedded?
You have been working with large companies for many years to help them with their adoption of "big data". How has your work on this book shifted or clarified your perspectives on the subject?
What are the main lessons or ideas that you hope readers will take away from the book?
What are the most interesting, innovative, or unexpected ways that you have seen big data applied to business?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on this book?
What are your predictions for the next decade of big data and AI?
Contact Info
@RandyBeanNVP on Twitter
LinkedIn
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Fail Fast, Learn Faster: Lessons in Data-Driven Leadership in an Age of Disruption, Big Data, and AI (affiliate link)
Book Website
Harvard Business Review
MIT Sloan Review
New Vantage Partners
COBOL
Moneyball
Weapons of Math Destruction
The Seven Roles of the Chief Data Officer
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 8, 2021 • 44min
Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql
Summary
The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
Your host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Metriql is and the story behind it?
What are the characteristics and benefits of a "headless BI" system?
What was your motivation to create and open-source Metriql as an independent project outside of your business?
How are you approaching governance and sustainability of the project?
How does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform?
How does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI?
What are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics?
Can you describe how Metriql is implemented?
How have the design and goals of the project changed or evolved since you began working on it?
What are the most complex/difficult engineering elements of building a metrics layer?
Can you describe the workflow of defining metrics?
What have been your guiding principles in defining the user experience for working with metriql?
What are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer)
What are the biggest challenges and limitations of creating metrics definitions purely in SQL?
What are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors?
What are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer?
What are the most interesting, innovative, or unexpected ways that you have seen Metriql used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql?
When is Metriql the wrong choice?
What do you have planned for the future of Metriql?
Contact Info
LinkedIn
Website
buremba on GitHub
@bu7emba on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Metriql
Rakam
Hazelcast
Headless BI
Google Data Studio
Superset
Podcast Episode
Podcast.__init__ Episode
Trino
Podcast Episode
Supergrain
The Missing Piece Of The Modern Data Stack article by Benn Stancil
Metabase
Podcast Episode
dbt
Podcast Episode
dbt-metabase
re_data
OpenMetadata
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 6, 2021 • 46min
Adding Support For Distributed Transactions To The Redpanda Streaming Engine
Summary
Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine
Interview
Introduction
How did you get involved in the area of data management?
Can you quickly recap what RedPanda is and the goals of the project?
What are the use cases for transactions in a pub/sub messaging system?
What are the elements of streaming systems that make atomic transactions a complex problem?
What was the motivation for starting down the path of adding transactions to the RedPanda engine?
How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics?
Can you talk through the details of how you ended up implementing transactions in RedPanda?
What are some of the roadblocks and complexities that you encountered while working through the implementation?
How did you approach the validation and verification of the transactions?
What other features or capabilities are you planning to work on next?
What are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda?
When are transactions the wrong choice?
What do you have planned for the future of transaction support in RedPanda?
Contact Info
@rystsov on twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Vectorized
RedPanda
Podcast Episode
RedPanda Transactions Post
Yandex
Cassandra
MongoDB
Riak
Cosmos DB
Jepsen
Podcast Episode
Testing Shared Memories paper
Journal of Systems Research
Kafka
Pulsar
Seastar Framework
CockroachDB
Podcast Episode
TiDB
Calvin Paper
Polyjuice Paper
Parallel Commit
Chaos Testing
Matchmaker Paxos Algorithm
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 2, 2021 • 1h 8min
Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike
Summary
Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Your host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Aerospike is and the story behind it?
What are the use cases that it is uniquely well suited for?
What are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience?
What are the driving factors for building a real-time data platform?
How is Aerospike being incorporated in application and data architectures?
Can you describe how the Aerospike engine is architected?
How have the design and architecture changed or evolved since it was first created?
How have market forces influenced the product priorities and focus?
What are the challenges that end users face when determining how to model their data given a key/value storage interface?
What are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures?
What are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.)
What are the most interesting, innovative, or unexpected ways that you have seen Aerospike used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike?
When is Aerospike the wrong choice?
What do you have planned for the future of Aerospike?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Aerospike
GitHub
EnterpriseDB
"Nobody Expects The Spanish Inquisition"
ARM CPU Architectures
AWS Graviton Processors
The Datacenter Is The Computer (Affiliate link)
Jepsen Tests
Podcast Episode
Cloud Native Computing Foundation
Prometheus
Grafana
OpenTelemetry
Podcast.__init__ Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 30, 2021 • 1h 12min
Delivering Your Personal Data Cloud With Prifina
Summary
The promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Your host is Tobias Macey and today I’m interviewing Markus Lampinen about Prifina, a platform for building applications powered by personal data that is under the user’s control
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Prifina is and the story behind it?
What are the primary goals of Prifina?
There has been a lof of interest in the "quantified self" and different projects (many that are open source) which aim to aggregate all of a user’s data into a single system for analysis and integration. What was lacking in the ecosystem that makes Prifina necessary/valuable?
What are some of the personalized applications for this data that have been most compelling or that users are most interested in?
What are the sources of complexity that you are facing when managing access/privacy of user’s data?
Can you describe the architecture of the platform that you are building?
What are the technological/social/economic underpinnings that are necessary to make a platform like Prifina possible?
What are the assumptions that you had when you first became involved in the project which have been challenged or invalidated as you worked through the implementation and began engaging with users and developers?
How do you approach schema definition/management for developers to have a stable implementation target?
How has that schema evolved as you introduced new data sources?
What are the barriers that you and your users have to deal with when obtaining copies of their data for use with Prifina?
What are the potential threats that you anticipate for users gaining and maintaining control of their own data?
What are the untapped opportunities?
What are the topics where you have had to invest the most in user education?
What are the most interesting, innovative, or unexpected ways that you have seen Prifina used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prifina?
When is Prifina the wrong choice?
What do you have planned for the future of Prifina?
Contact Info
LinkedIn
@mmlampinen on Twitter
mmlampinen on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Prifina
Google Takeout
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast