

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Oct 8, 2021 • 44min
Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql
Summary
The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
Your host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Metriql is and the story behind it?
What are the characteristics and benefits of a "headless BI" system?
What was your motivation to create and open-source Metriql as an independent project outside of your business?
How are you approaching governance and sustainability of the project?
How does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform?
How does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI?
What are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics?
Can you describe how Metriql is implemented?
How have the design and goals of the project changed or evolved since you began working on it?
What are the most complex/difficult engineering elements of building a metrics layer?
Can you describe the workflow of defining metrics?
What have been your guiding principles in defining the user experience for working with metriql?
What are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer)
What are the biggest challenges and limitations of creating metrics definitions purely in SQL?
What are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors?
What are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer?
What are the most interesting, innovative, or unexpected ways that you have seen Metriql used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql?
When is Metriql the wrong choice?
What do you have planned for the future of Metriql?
Contact Info
LinkedIn
Website
buremba on GitHub
@bu7emba on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Metriql
Rakam
Hazelcast
Headless BI
Google Data Studio
Superset
Podcast Episode
Podcast.__init__ Episode
Trino
Podcast Episode
Supergrain
The Missing Piece Of The Modern Data Stack article by Benn Stancil
Metabase
Podcast Episode
dbt
Podcast Episode
dbt-metabase
re_data
OpenMetadata
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 6, 2021 • 46min
Adding Support For Distributed Transactions To The Redpanda Streaming Engine
Summary
Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine
Interview
Introduction
How did you get involved in the area of data management?
Can you quickly recap what RedPanda is and the goals of the project?
What are the use cases for transactions in a pub/sub messaging system?
What are the elements of streaming systems that make atomic transactions a complex problem?
What was the motivation for starting down the path of adding transactions to the RedPanda engine?
How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics?
Can you talk through the details of how you ended up implementing transactions in RedPanda?
What are some of the roadblocks and complexities that you encountered while working through the implementation?
How did you approach the validation and verification of the transactions?
What other features or capabilities are you planning to work on next?
What are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda?
When are transactions the wrong choice?
What do you have planned for the future of transaction support in RedPanda?
Contact Info
@rystsov on twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Vectorized
RedPanda
Podcast Episode
RedPanda Transactions Post
Yandex
Cassandra
MongoDB
Riak
Cosmos DB
Jepsen
Podcast Episode
Testing Shared Memories paper
Journal of Systems Research
Kafka
Pulsar
Seastar Framework
CockroachDB
Podcast Episode
TiDB
Calvin Paper
Polyjuice Paper
Parallel Commit
Chaos Testing
Matchmaker Paxos Algorithm
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 2, 2021 • 1h 8min
Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike
Summary
Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Your host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Aerospike is and the story behind it?
What are the use cases that it is uniquely well suited for?
What are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience?
What are the driving factors for building a real-time data platform?
How is Aerospike being incorporated in application and data architectures?
Can you describe how the Aerospike engine is architected?
How have the design and architecture changed or evolved since it was first created?
How have market forces influenced the product priorities and focus?
What are the challenges that end users face when determining how to model their data given a key/value storage interface?
What are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures?
What are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.)
What are the most interesting, innovative, or unexpected ways that you have seen Aerospike used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike?
When is Aerospike the wrong choice?
What do you have planned for the future of Aerospike?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Aerospike
GitHub
EnterpriseDB
"Nobody Expects The Spanish Inquisition"
ARM CPU Architectures
AWS Graviton Processors
The Datacenter Is The Computer (Affiliate link)
Jepsen Tests
Podcast Episode
Cloud Native Computing Foundation
Prometheus
Grafana
OpenTelemetry
Podcast.__init__ Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 30, 2021 • 1h 12min
Delivering Your Personal Data Cloud With Prifina
Summary
The promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Your host is Tobias Macey and today I’m interviewing Markus Lampinen about Prifina, a platform for building applications powered by personal data that is under the user’s control
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Prifina is and the story behind it?
What are the primary goals of Prifina?
There has been a lof of interest in the "quantified self" and different projects (many that are open source) which aim to aggregate all of a user’s data into a single system for analysis and integration. What was lacking in the ecosystem that makes Prifina necessary/valuable?
What are some of the personalized applications for this data that have been most compelling or that users are most interested in?
What are the sources of complexity that you are facing when managing access/privacy of user’s data?
Can you describe the architecture of the platform that you are building?
What are the technological/social/economic underpinnings that are necessary to make a platform like Prifina possible?
What are the assumptions that you had when you first became involved in the project which have been challenged or invalidated as you worked through the implementation and began engaging with users and developers?
How do you approach schema definition/management for developers to have a stable implementation target?
How has that schema evolved as you introduced new data sources?
What are the barriers that you and your users have to deal with when obtaining copies of their data for use with Prifina?
What are the potential threats that you anticipate for users gaining and maintaining control of their own data?
What are the untapped opportunities?
What are the topics where you have had to invest the most in user education?
What are the most interesting, innovative, or unexpected ways that you have seen Prifina used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prifina?
When is Prifina the wrong choice?
What do you have planned for the future of Prifina?
Contact Info
LinkedIn
@mmlampinen on Twitter
mmlampinen on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Prifina
Google Takeout
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 26, 2021 • 58min
Digging Into Data Reliability Engineering
Summary
The accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Your host is Tobias Macey and today I’m interviewing Egor Gryaznov, co-founder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate it into your systems
Interview
Introduction
How did you get involved in the area of data management?
What does the term "Data Reliability Engineering" mean?
What is encompassed under the umbrella of Data Reliability Engineering?
How does it compare to the concepts from site reliability engineering?
Is DRE just a repackaged version of DataOps?
Why is Data Reliability Engineering particularly important now?
Who is responsible for the practice of DRE in an organization?
What are some areas of innovation that teams are focusing on to support a DRE practice?
What are the tools that teams are using to improve the reliability of their data operations?
What are the organizational systems that need to be in place to support a DRE practice?
What are some potential roadblocks that teams might have to address when planning and implementing a DRE strategy?
What are the most interesting, innovative, or unexpected approaches/solutions to DRE that you have seen?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Reliability Engineering?
Is Data Reliability Engineering ever the wrong choice?
What do you have planned for the future of Bigeye, especially in terms of Data Reliability Engineering?
Contact Info
Find us at bigeye.com or reach out to us at hello@bigeye.com
You can find Egor on LinkedIn or email him at egor@bigeye.com
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Bigeye
Podcast Episode
Vertica
Looker
Podcast Episode
Site Reliability Engineering
Stemma
Podcast Episode
Collibra
Podcast Episode
OpenLineage
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 25, 2021 • 1h 4min
Massively Parallel Data Processing In Python Without The Effort Using Bodo
Summary
Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Your host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Bodo is and the story behind it?
What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows?
Why have you focused your efforts on the Python language and toolchain?
Do you see any potential for expanding into other language communities?
What are the shortcomings of projects such as Dask and Ray for scaling out Python data projects?
Many people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC?
What are the tradeoffs of HPC vs scale-out distributed systems?
Can you describe the technical implementation of the Bodo platform?
What are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler?
How do you handle compiled extensions? (e.g. C/C++/Fortran)
What are some of the assumptions/expectations that you had when first approaching this project that have been challenged as you progressed through its implementation?
How do you handle data distribution for scale out computation?
What are some software architecture/programming patterns that act as bottlenecks/optimization cliffs for parallelization?
What are some of the educational challenges that you have run into while working with potential and current customers?
What are the most interesting, innovative, or unexpected ways that you have seen Bodo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo?
When is Bodo the wrong choice?
What do you have planned for the future of Bodo?
Contact Info
LinkedIn
@EhsanTn on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Bodo
High Performance Computing (HPC)
University of Illinois, Urbana-Champaign
Julia Language
Pandas
Podcast.__init__ Episode
NumPy
Dask
Podcast Episode
Ray
Podcast.__init__ Episode
Numba
LLVM
SPMD
MPI
Elastic Fabric Adapter
Iceberg Table Format
Podcast Episode
IPython Parallel
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 19, 2021 • 1h 12min
Declarative Machine Learning Without The Operational Overhead Using Continual
Summary
Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Tristan Zajonc about Continual, a platform for automating the creation and application of operational AI on top of your data warehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Continual is and the story behind it?
What is your definition for "operational AI" and how does it differ from other applications of ML/AI?
What are some example use cases for AI in an operational capacity?
What are the barriers to adoption for organizations that want to take advantage of predictive analytics?
Who are the target users of Continual?
Can you describe how the Continual platform is implemented?
How has the design and infrastructure changed or evolved since you first began working on it?
What is the workflow for someone building a model and putting it into production?
Once a model has been deployed, what are the mechanisms that you expose for interacting with it?
How does this differ from in-database ML capabilities such as what is offered by Vertica and BigQuery?
How much understanding of ML/AI principles is necessary for someone to create a model with Continual?
What is your estimation of the impact that Continual can have on the overall productivity of a data team/data scientist?
What are the most interesting, innovative, or unexpected ways that you have seen Continual used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Continual?
When is Continual the wrong choice?
What do you have planned for the future of Continual?
Contact Info
LinkedIn
@tristanzajonc on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Continual
World Bank
SAS
SPSS
Stata
Feature Store
DataRobot
Transfer Learning
dbt
Podcast Episode
Ludwig
Overton (Apple)
Hightouch
Census
Galaxy Schema
In-Database ML Podcast Episode
scikit-learn
Snorkel
Podcast Episode
Materialize
Podcast Episode
Flink SQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 19, 2021 • 55min
An Exploration Of The Data Engineering Requirements For Bioinformatics
Summary
Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Your host is Tobias Macey and today I’m interviewing Jillian Rowe about data engineering practices for bioinformatics projects
Interview
Introduction
How did you get involved in the area of data management?
How did you get into the field of bioinformatics?
Can you describe what is unique about data needs in bioinformatics?
What are some of the problems that you have found yourself regularly solving for your clients?
When building data engineering stacks for bioinformatics, what are the attributes that you are optimizing for? (e.g. speed, UX, scale, correctness, etc.)
Can you describe a typical set of technologies that you implement when working on a new project?
What kinds of systems do you need to integrate with?
What are the data formats that are widely used for bioinformatics?
What are some details that a data engineer would need to know to work effectively with those formats while preparing data for analysis?
What amount of domain expertise is necessary for a data engineer to work in life sciences?
What are the most interesting, innovative, or unexpected solutions that you have seen for manipulating bioinformatics data?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on bioinformatics projects?
What are some of the industry/academic trends or upcoming technologies that you are tracking for bioinformatics?
Contact Info
LinkedIn
jerowe on GitHub
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Bioinformatics
How Perl Saved The Human Genome Project
Neo4J
AWS Parallel Cluster
Datashader
R Shiny
Plotly Dash
Apache Parquet
Dask
Podcast Episode
HDF5
Spark
Superset
Data Engineering Podcast Episode
Podcast.__init__ Episode
FastQ file format
BAM (Binary Alignment Map) File
Variant Call Format (VCF)
HIPAA
DVC
Podcast Episode
LakeFS
BioThings API
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Sep 12, 2021 • 59min
Setting The Stage For The Next Chapter Of The Cassandra Database
Summary
The Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. In this episode Ben Bromhead, CTO of Instaclustr, shares the challenges that the community has worked through, the work that went into the release, and how the stability and testing improvements are setting the stage for the future of the project.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Ben Bromhead about the recent release of Cassandra version 4 and how it fits in the current landscape of data tools
Interview
Introduction
How did you get involved in the area of data management?
For anyone who isn’t familiar with Cassandra, can you briefly describe what it is and some of the story behind it?
How did you get involved in the Cassandra project and how would you characterize your role?
What are the main use cases and industries where someone is likely to use Cassandra?
What is notable about the version 4 release?
What were some of the factors that contributed to the long delay between versions 3 and 4? (2015 – 2021)
What are your thoughts on the ongoing utility/benefits of projects such as ScyllaDB, particularly in light of the most recent release?
Cassandra is primarily used as a system of record. What are some of the tools and system architectures that users turn to when building analytical workloads for data stored in Cassandra?
The architecture of Cassandra has lent itself well to the cloud native ecosystem that has been growing in recent years. What do you see as the opportunities for Cassandra over the near to medium term as the cloud continues to grow in prominence?
What are some of the challenges that you and the Cassandra community have faced with the flurry of new data storage and processing systems that have popped up over the past few years?
What are the most interesting, innovative, or unexpected ways that you have seen Cassandra used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cassandra?
When is Cassandra the wrong choice?
What is in store for the future of Cassandra?
Contact Info
LinkedIn
@benbromhead on Twitter
benbromhead on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Cassandra
Instaclustr
HBase
DynamoDB Whitepaper
Property Based Testing
QuickTheories
Riak
FoundationDB
Podcast Episode
ScyllaDB
Podcast Episode
YugabyteDB
Podcast Episode
Azure CosmoDB
Amazon Keyspaces
Netty
Kafka
CQRS == Command Query Responsibility Segregation
Elasticsearch
Redis
Memcached
Debezium
Podcast Episode
CDC == Change Data Capture
Podcast Episodes
Bigtable White Paper
CockroachDB
Podcast Episode
Vitess
CAP Theorem
Paxos
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 9, 2021 • 1h 4min
A View From The Round Table Of Gartner's Cool Vendors
Summary
Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Your host is Tobias Macey and today I’m interviewing Saket Saurabh, Maarten Masschelein, Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work they are doing that got them recognized as "cool vendors" by Gartner.
Interview
Introduction
How did you get involved in the area of data management?
Can you each describe what you view as the biggest challenge facing data professionals?
Who are you building your solutions for and what are the most common data management problems are you all solving?
What are different components of Data Management and why is it so complex?
What will simplify this process, if any?
The report covers a lot of new data management terminology – data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps – what does this all mean and why is it important for data engineers?
How has the data management space changed in recent times? Describe the current data management landscape and any key developments.
From your perspective, what are the biggest challenges in the data management space today? What modern data management features are lacking in existing databases?
Gartner imagines a future where data and analytics leaders need to be prepared to rely on data management solutions that make heterogeneous, distributed data appear consolidated, easy to access and business friendly. How does this tally with your vision of the future of data management and what needs to happen to make this a reality?
What are the most interesting, innovative, or unexpected ways that you have seen your respective products used (in isolation or combined)?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on your respective platforms?
What are the upcoming trends and challenges that you are keeping a close eye on?
Contact Info
Saket
LinkedIn
@saketsaurabh on Twitter
Maarten
LinkedIn
@masscheleinm on Twitter
Dan
LinkedIn
Akshay
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Nexla
Soda
Tada
Timbr
Collibra
Podcast Episode
Gartner Cool Vendors
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast