
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

Sep 26, 2021 • 58min
Digging Into Data Reliability Engineering
Summary
The accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Your host is Tobias Macey and today I’m interviewing Egor Gryaznov, co-founder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate it into your systems
Interview
Introduction
How did you get involved in the area of data management?
What does the term "Data Reliability Engineering" mean?
What is encompassed under the umbrella of Data Reliability Engineering?
How does it compare to the concepts from site reliability engineering?
Is DRE just a repackaged version of DataOps?
Why is Data Reliability Engineering particularly important now?
Who is responsible for the practice of DRE in an organization?
What are some areas of innovation that teams are focusing on to support a DRE practice?
What are the tools that teams are using to improve the reliability of their data operations?
What are the organizational systems that need to be in place to support a DRE practice?
What are some potential roadblocks that teams might have to address when planning and implementing a DRE strategy?
What are the most interesting, innovative, or unexpected approaches/solutions to DRE that you have seen?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Reliability Engineering?
Is Data Reliability Engineering ever the wrong choice?
What do you have planned for the future of Bigeye, especially in terms of Data Reliability Engineering?
Contact Info
Find us at bigeye.com or reach out to us at hello@bigeye.com
You can find Egor on LinkedIn or email him at egor@bigeye.com
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Bigeye
Podcast Episode
Vertica
Looker
Podcast Episode
Site Reliability Engineering
Stemma
Podcast Episode
Collibra
Podcast Episode
OpenLineage
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 25, 2021 • 1h 4min
Massively Parallel Data Processing In Python Without The Effort Using Bodo
Summary
Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Your host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Bodo is and the story behind it?
What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows?
Why have you focused your efforts on the Python language and toolchain?
Do you see any potential for expanding into other language communities?
What are the shortcomings of projects such as Dask and Ray for scaling out Python data projects?
Many people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC?
What are the tradeoffs of HPC vs scale-out distributed systems?
Can you describe the technical implementation of the Bodo platform?
What are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler?
How do you handle compiled extensions? (e.g. C/C++/Fortran)
What are some of the assumptions/expectations that you had when first approaching this project that have been challenged as you progressed through its implementation?
How do you handle data distribution for scale out computation?
What are some software architecture/programming patterns that act as bottlenecks/optimization cliffs for parallelization?
What are some of the educational challenges that you have run into while working with potential and current customers?
What are the most interesting, innovative, or unexpected ways that you have seen Bodo used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo?
When is Bodo the wrong choice?
What do you have planned for the future of Bodo?
Contact Info
LinkedIn
@EhsanTn on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Bodo
High Performance Computing (HPC)
University of Illinois, Urbana-Champaign
Julia Language
Pandas
Podcast.__init__ Episode
NumPy
Dask
Podcast Episode
Ray
Podcast.__init__ Episode
Numba
LLVM
SPMD
MPI
Elastic Fabric Adapter
Iceberg Table Format
Podcast Episode
IPython Parallel
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 19, 2021 • 1h 12min
Declarative Machine Learning Without The Operational Overhead Using Continual
Summary
Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Tristan Zajonc about Continual, a platform for automating the creation and application of operational AI on top of your data warehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Continual is and the story behind it?
What is your definition for "operational AI" and how does it differ from other applications of ML/AI?
What are some example use cases for AI in an operational capacity?
What are the barriers to adoption for organizations that want to take advantage of predictive analytics?
Who are the target users of Continual?
Can you describe how the Continual platform is implemented?
How has the design and infrastructure changed or evolved since you first began working on it?
What is the workflow for someone building a model and putting it into production?
Once a model has been deployed, what are the mechanisms that you expose for interacting with it?
How does this differ from in-database ML capabilities such as what is offered by Vertica and BigQuery?
How much understanding of ML/AI principles is necessary for someone to create a model with Continual?
What is your estimation of the impact that Continual can have on the overall productivity of a data team/data scientist?
What are the most interesting, innovative, or unexpected ways that you have seen Continual used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Continual?
When is Continual the wrong choice?
What do you have planned for the future of Continual?
Contact Info
LinkedIn
@tristanzajonc on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Continual
World Bank
SAS
SPSS
Stata
Feature Store
DataRobot
Transfer Learning
dbt
Podcast Episode
Ludwig
Overton (Apple)
Hightouch
Census
Galaxy Schema
In-Database ML Podcast Episode
scikit-learn
Snorkel
Podcast Episode
Materialize
Podcast Episode
Flink SQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 19, 2021 • 55min
An Exploration Of The Data Engineering Requirements For Bioinformatics
Summary
Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
Your host is Tobias Macey and today I’m interviewing Jillian Rowe about data engineering practices for bioinformatics projects
Interview
Introduction
How did you get involved in the area of data management?
How did you get into the field of bioinformatics?
Can you describe what is unique about data needs in bioinformatics?
What are some of the problems that you have found yourself regularly solving for your clients?
When building data engineering stacks for bioinformatics, what are the attributes that you are optimizing for? (e.g. speed, UX, scale, correctness, etc.)
Can you describe a typical set of technologies that you implement when working on a new project?
What kinds of systems do you need to integrate with?
What are the data formats that are widely used for bioinformatics?
What are some details that a data engineer would need to know to work effectively with those formats while preparing data for analysis?
What amount of domain expertise is necessary for a data engineer to work in life sciences?
What are the most interesting, innovative, or unexpected solutions that you have seen for manipulating bioinformatics data?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on bioinformatics projects?
What are some of the industry/academic trends or upcoming technologies that you are tracking for bioinformatics?
Contact Info
LinkedIn
jerowe on GitHub
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Bioinformatics
How Perl Saved The Human Genome Project
Neo4J
AWS Parallel Cluster
Datashader
R Shiny
Plotly Dash
Apache Parquet
Dask
Podcast Episode
HDF5
Spark
Superset
Data Engineering Podcast Episode
Podcast.__init__ Episode
FastQ file format
BAM (Binary Alignment Map) File
Variant Call Format (VCF)
HIPAA
DVC
Podcast Episode
LakeFS
BioThings API
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

4 snips
Sep 12, 2021 • 59min
Setting The Stage For The Next Chapter Of The Cassandra Database
Summary
The Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. In this episode Ben Bromhead, CTO of Instaclustr, shares the challenges that the community has worked through, the work that went into the release, and how the stability and testing improvements are setting the stage for the future of the project.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Ben Bromhead about the recent release of Cassandra version 4 and how it fits in the current landscape of data tools
Interview
Introduction
How did you get involved in the area of data management?
For anyone who isn’t familiar with Cassandra, can you briefly describe what it is and some of the story behind it?
How did you get involved in the Cassandra project and how would you characterize your role?
What are the main use cases and industries where someone is likely to use Cassandra?
What is notable about the version 4 release?
What were some of the factors that contributed to the long delay between versions 3 and 4? (2015 – 2021)
What are your thoughts on the ongoing utility/benefits of projects such as ScyllaDB, particularly in light of the most recent release?
Cassandra is primarily used as a system of record. What are some of the tools and system architectures that users turn to when building analytical workloads for data stored in Cassandra?
The architecture of Cassandra has lent itself well to the cloud native ecosystem that has been growing in recent years. What do you see as the opportunities for Cassandra over the near to medium term as the cloud continues to grow in prominence?
What are some of the challenges that you and the Cassandra community have faced with the flurry of new data storage and processing systems that have popped up over the past few years?
What are the most interesting, innovative, or unexpected ways that you have seen Cassandra used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cassandra?
When is Cassandra the wrong choice?
What is in store for the future of Cassandra?
Contact Info
LinkedIn
@benbromhead on Twitter
benbromhead on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Cassandra
Instaclustr
HBase
DynamoDB Whitepaper
Property Based Testing
QuickTheories
Riak
FoundationDB
Podcast Episode
ScyllaDB
Podcast Episode
YugabyteDB
Podcast Episode
Azure CosmoDB
Amazon Keyspaces
Netty
Kafka
CQRS == Command Query Responsibility Segregation
Elasticsearch
Redis
Memcached
Debezium
Podcast Episode
CDC == Change Data Capture
Podcast Episodes
Bigtable White Paper
CockroachDB
Podcast Episode
Vitess
CAP Theorem
Paxos
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 9, 2021 • 1h 4min
A View From The Round Table Of Gartner's Cool Vendors
Summary
Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Your host is Tobias Macey and today I’m interviewing Saket Saurabh, Maarten Masschelein, Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work they are doing that got them recognized as "cool vendors" by Gartner.
Interview
Introduction
How did you get involved in the area of data management?
Can you each describe what you view as the biggest challenge facing data professionals?
Who are you building your solutions for and what are the most common data management problems are you all solving?
What are different components of Data Management and why is it so complex?
What will simplify this process, if any?
The report covers a lot of new data management terminology – data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps – what does this all mean and why is it important for data engineers?
How has the data management space changed in recent times? Describe the current data management landscape and any key developments.
From your perspective, what are the biggest challenges in the data management space today? What modern data management features are lacking in existing databases?
Gartner imagines a future where data and analytics leaders need to be prepared to rely on data management solutions that make heterogeneous, distributed data appear consolidated, easy to access and business friendly. How does this tally with your vision of the future of data management and what needs to happen to make this a reality?
What are the most interesting, innovative, or unexpected ways that you have seen your respective products used (in isolation or combined)?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on your respective platforms?
What are the upcoming trends and challenges that you are keeping a close eye on?
Contact Info
Saket
LinkedIn
@saketsaurabh on Twitter
Maarten
LinkedIn
@masscheleinm on Twitter
Dan
LinkedIn
Akshay
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Nexla
Soda
Tada
Timbr
Collibra
Podcast Episode
Gartner Cool Vendors
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 4, 2021 • 60min
Designing And Building Data Platforms As A Product
Summary
The term "data platform" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. This is a valuable conversation about how to approach the work of selecting the tools that you use to power your data systems and considerations for how they can be woven together for a unified experience across your various stakeholders.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Lior Gavish, Lior Solomon, and Atul Gupte about the technical, social, and architectural aspects of building your data platform as a product for your internal customers
Interview
Introduction
How did you get involved in the area of data management? – all
Can we start by establishing a definition of "data platform" for the purpose of this conversation?
Who are the stakeholders in a data platform?
Where does the responsibility lie for creating and maintaining ("owning") the platform?
What are some of the technical and organizational constraints that are likely to factor into the design and execution of the platform?
What are the minimum set of requirements necessary to qualify as a platform? (as opposed to a collection of discrete components)
What are the additional capabilities that should be in place to simplify the use and maintenance of the platform?
How are data platforms managed? Are they managed by technical teams, product managers, etc.? What is the profile for a data product manager? – Atul G.
How do you set SLIs / SLOs with your data platform team when you don’t have clear metrics you’re tracking? – Lior S.
There has been a lot of conversation recently about different interpretations of the "modern data stack". For a team who is just starting to build out their platform, how much credence should they be giving to those debates?
What are the first steps that you recommend for those practitioners?
If an organization already has infrastructure in place for data/analytics, how might they think about building or buying their way toward a well integrated platform?
Once a platform is established, what are some challenges that teams should anticipate in scaling the platform?
Which axes of scale have you found to be most difficult to manage? (scale of infrastructure capacity, scale of organizational/technical complexity, scale of usage, etc.)
Do we think the "data platform" is a skill set? How do we split up the role of the platform? Is there one for real-time? Is there one for ETLs?
How do you handle the quality and reliability of the data powering your solution?
What are helpful techniques that you have used for collecting, prioritizing, and managing feature requests?
How do you justify the budget and resources for your data platform?
How do you measure the success of a data platform?
What is the relationship between a data platform and data products?
Are there any other companies you admire when it comes to building robust, scalable data architecture?
What are the most interesting, innovative, or unexpected ways that you have seen data platforms used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building and operating a data platform?
When is a data platform the wrong choice? (as opposed to buying an integrated solution, etc.)
What are the industry trends that you are monitoring/excited for in the space of data platforms?
Contact Info
Lior Gavish
LinkedIn
@lgavish on Twitter
Lior Solomon
LinkedIn
@liorsolomon on Twitter
Atul Gupte
LinkedIn
@atulgupte on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Monte Carlo
Vimeo
Facebook
Uber
Zynga
Great Expectations
Podcast Episode
Airflow
Podcast.__init__ Episode
Fivetran
Podcast Episode
dbt
Podcast Episode
Snowflake
Podcast Episode
Looker
Podcast Episode
Modern Data Stack Podcast Episode
Stitch
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 2, 2021 • 1h 1min
Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana
Summary
The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible option for running your analytics in the cloud. In this episode Dipti Borkar discusses the work that she and her team are doing at Ahana to simplify the work of running your own PrestoDB environment in the cloud. She explains how they are optimizin the runtime to reduce latency and increase query throughput, the ways that they are contributing back to the open source community, and the exciting improvements that are in the works to make Presto an even more powerful option for all of your analytics.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Dipti Borkar, cofounder Ahana about Presto and Ahana, SaaS managed service for Presto
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Ahana is and the story behind it?
There has been a lot of recent activity in the Presto community. Can you give an overview of the options that are available for someone wanting to use its SQL engine for querying their data?
What is Ahana’s role in the community/ecosystem?
(happy to skip this question if it’s too contentious) What are some of the notable differences that have emerged over the past couple of years between the Trino (formerly PrestoSQL) and PrestoDB projects?
Another area that has been seeing a lot of activity is data lakes and projects to make them more manageable and feature complete (e.g. Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.). How has that influenced your product focus and capabilities?
How does this activity change the calculus for organizations who are deciding on a lake or warehouse for their data architecture?
Can you describe how the Ahana Cloud platform is architected?
What are the additional systems that you have built to manage deployment, scaling, and multi-tenancy?
Beyond the storage and processing, what are the other notable tools and projects that have become part of the overall stack for supporting open analytics?
What are some areas of ongoing activity that you are keeping an eye on as you build out the Ahana offerings?
What are the most interesting, innovative, or unexpected ways that you have seen Ahana/Presto used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ahana?
When is Ahana the wrong choice?
What do you have planned for the future of Ahana?
Contact Info
LinkedIn
@dborkar on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Ahana
Alluxio
Podcast Episode
Couchbase
Kinetica
Tensorflow
PyTorch
Podcast.__init__ Episode
AWS Athena
AWS Glue
Hive Metastore
Clickhouse
Dremio
Podcast Episode
Apache Drill
Teradata
Snowflake
Podcast Episode
BigQuery
RaptorX
Aria Optimizations for Presto
Apache Ranger
Presto Plugin
Trino
Podcast Episode
Starburst
Podcast Episode
Hive
Iceberg
Podcast Episode
Hudi
Podcast Episode
Delta Lake
Podcast Episode
Superset
Podcast.__init__ Episode
Data Engineering Podcast Episode
Nessie
LakeFS
Amundsen
Podcast Episode
DataHub
Podcast Episode
OtterTune
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 28, 2021 • 51min
Do Away With Data Integration Through A Dataware Architecture With Cinchy
Summary
The reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a "Dataware" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Your host is Tobias Macey and today I’m interviewing Dan DeMers about Cinchy, a dataware platform aiming to simplify the work of data integration by eliminating ETL/ELT
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Cinchy is and the story behind it?
In your experience working in data and building complex enterprise-grade systems, what are the shortcomings and negative externalities of an ETL/ELT approach to data integration?
How is a Dataware platform from a data lake or data warehouses? What is it used for?
What is Zero-Copy Integration? How does that work?
Can you describe how customers start their Cinchy journey?
What are the main use case patterns that you’re seeing with Dataware?
Your platform offers unlimited users, including business users. What are some of the challenges that you face in building a user experience that doesn’t become overwhelming as an organization scales the number of data sources and processing flows?
What are the most interesting, innovative, or unexpected ways that you have seen Cinchy used?
When is Cinchy the wrong choice for a customer?
Can you describe the technical architecture of the Cinchy platform?
How do you establish connections/relationships among data from disparate sources?
How do you manage schema evolution in source systems?
What are some of the edge cases that users need to consider as they are designing and building those connections?
What are some of the features or capabilities of Cinchy that you think are overlooked or under-utilized?
How has your understanding of the problem space changed since you started working on Cinchy?
How has the architecture and design of the system evolved to reflect that updated understanding?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cinchy?
What do you have planned for the future of Cinchy?
Contact Info
LinkedIn
@dandemers on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Cinchy
Gordon Everest
Data Collaboration Alliance
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 25, 2021 • 58min
Decoupling Data Operations From Data Infrastructure Using Nexla
Summary
The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Saket Saurabh and Avinash Shahdadpuri about Nexla, a platform for powering data operations and sharing within and across businesses
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Nexla is and the story behind it?
What are the major problems that Nexla is aiming to solve?
What are the components of a data platform that Nexla might replace?
What are the use cases and benefits of being able to publish data sets for use outside and across organizations?
What are the different elements involved in implementing DataOps?
How is the Nexla platform implemented?
What have been the most comple engineering challenges?
How has the architecture changed or evolved since you first began working on it?
What are some of the assumptions that you had at the start which have been challenged or invalidated?
What are some of the heuristics that you have found most useful in generating logical units of data in an automated fashion?
Once a Nexset has been created, what are some of the ways that they can be used or further processed?
What are the attributes of a Nexset? (e.g. access control policies, lineage, etc.)
How do you handle storage and sharing of a Nexset?
What are some of your grand hopes and ambitions for the Nexla platform and the potential for data exchanges?
What are the most interesting, innovative, or unexpected ways that you have seen Nexla used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Nexla?
When is Nexla the wrong choice?
What do you have planned for the future of Nexla?
Contact Info
Saket
LinkedIn
@saketsaurabh on Twitter
Avinash
LinkedIn
@avinashpuri on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Nexla
Nexsets
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast