

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Sep 1, 2020 • 1h 6min
Building A Better Data Warehouse For The Cloud At Firebolt
Summary
Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage. Firebolt is taking that a step further with a core focus on speed and interactivity. In this episode CEO and founder Eldad Farkash explains how the Firebolt platform is architected for high throughput, their simple and transparent pricing model to encourage widespread use, and the use cases that it unlocks through interactive query speeds.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Eldad Farkash about Firebolt, a cloud data warehouse optimized for speed and elasticity on structured and semi-structured data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Firebolt is and your motivation for building it?
How does Firebolt compare to other data warehouse technologies what unique features does it provide?
The lines between a data warehouse and a data lake have been blurring in recent years. Where on that continuum does Firebolt lie?
What are the unique use cases that Firebolt allows for?
How do the performance characteristics of Firebolt change the ways that an engineer should think about data modeling?
What technologies might someone replace with Firebolt?
How is Firebolt architected and how has the design evolved since you first began working on it?
What are some of the most challenging aspects of building a data warehouse platform that is optimized for speed?
How do you handle support for nested and semi-structured data?
In what ways have you found it necessary/useful to extend SQL?
Due to the immutability of object storage, for data lakes the update or delete process involves reprocessing a potentially large amount of data. How do you approach that in Firebolt with your F3 format?
What have you found to be the most interesting, unexpected, or challenging lessons while building and scaling the Firebolt platform and business?
When is Firebolt the wrong choice?
What do you have planned for the future of Firebolt?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Firebolt
Sisense
SnowflakeDB
Podcast Episode
Redshift
Spark
Podcast Episode
Parquet
Podcast Episode
Hadoop
HDFS
S3
AWS Athena
BigQuery
Data Vault
Podcast Episode
Star Schema
Dimensional Modeling
Slowly Changing Dimensions
JDBC
TPC Benchmarks
DBT
Podcast Episode
Tableau
Looker
Podcast Episode
PrestoSQL
Podcast Episode
PostgreSQL
Podcast Episode
FoundationDB
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 25, 2020 • 51min
Metadata Management And Integration At LinkedIn With DataHub
Summary
In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about DataHub, LinkedIn’s metadata management and data catalog platform
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what DataHub is and some of its back story?
What were you using at LinkedIn for metadata management prior to the introduction of DataHub?
What was lacking in the previous solutions that motivated you to create a new platform?
There are a large number of other systems available for building data catalogs and tracking metadata, both open source and proprietary. What are the features of DataHub that would lead someone to use it in place of the other options?
Who is the target audience for DataHub?
How do the needs of those end users influence or constrain your approach to the design and interfaces provided by DataHub?
Can you describe how DataHub is architected?
How has it evolved since you first began working on it?
What was your motivation for releasing DataHub as an open source project?
What have been the benefits of that decision?
What are the challenges that you face in maintaining changes between the public repository and your internally deployed instance?
What is the workflow for populating metadata into DataHub?
What are the challenges that you see in managing the format of metadata and establishing consistent models for the information being stored?
How do you handle discovery of data assets for users of DataHub?
What are the integration and extension points of the platform?
What is involved in deploying and maintaining and instance of the DataHub platform?
What are some of the most interesting or unexpected ways that you have seen DataHub used inside or outside of LinkedIn?
What are some of the most interesting, unexpected, or challenging lessons that you learned while building and working with DataHub?
When is DataHub the wrong choice?
What do you have planned for the future of the project?
Contact Info
Mars
LinkedIn
mars-lan on GitHub
Pardhu
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
DataHub
Map/Reduce
Apache Flume
LinkedIn Blog Post introducing DataHub
WhereHows
Hive Metastore
Kafka
CDC == Change Data Capture
Podcast Episode
PDL LinkedIn language
GraphQL
Elasticsearch
Neo4J
Apache Pinot
Apache Gobblin
Apache Samza
Open Sourcing DataHub Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 17, 2020 • 1h 6min
Exploring The TileDB Universal Data Engine
Summary
Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Stavros Papadopoulos about TileDB, the universal storage engine
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what TileDB is and the problem that you are trying to solve with it?
What was your motivation for building it?
What are the main use cases or problem domains that you are trying to solve for?
What are the shortcomings of existing approaches to database design that prevent them from being useful for these applications?
What are the benefits of using matrices for data processing and domain modeling?
What are the challenges that you have faced in storing and processing sparse matrices efficiently?
How does the usage of matrices as the foundational primitive affect the way that users should think about data modeling?
What are the benefits of unbundling the storage engine from the processing layer
Can you describe how TileDB embedded is architected?
How has the design evolved since you first began working on it?
What is your approach to integrating with the broader ecosystem of data storage and processing utilities?
What does the workflow look like for someone using TileDB?
What is required to deploy TileDB in a production context?
How is the built in data versioning implemented?
What is the user experience for interacting with different versions of datasets?
How do you manage the lifecycle of versioned data to allow garbage collection?
How are you managing the governance and ongoing sustainability of the open source project, and the commercial offerings that you are building on top of it?
What are the most interesting, unexpected, or innovative ways that you have seen TileDB used?
What have you found to be the most interesting, unexpected, or challenging aspects of building TileDB?
What features or capabilities are you consciously deciding not to implement?
When is TileDB the wrong choice?
What do you have planned for the future of TileDB?
Contact Info
LinkedIn
stavrospapadopoulos on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
TileDB
GitHub
Data Frames
TileDB Cloud
MIT
Intel
Sparse Linear Algebra
Sparse Matrices
HDF5
Dask
Spark
MariaDB
PrestoDB
GDAL
PDAL
Turing Complete
Clustered Index
Parquet File Format
Podcast Episode
Serializability
Delta Lake
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 10, 2020 • 59min
Closing The Loop On Event Data Collection With Iteratively
Summary
Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Patrick Thompson and Ondrej Hrebicek about Iteratively, a platform for enforcing consistent schemas for your event data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Iteratively and your motivation for creating it?
What are some of the ways that you have seen inconsistent message structures cause problems?
What are some of the common anti-patterns that you have seen for managing the structure of event messages?
What are the benefits that Iteratively provides for the different roles in an organization?
Can you describe the workflow for a team using Iteratively?
How is the Iteratively platform architected?
How has the design changed or evolved since you first began working on it?
What are the difficulties that you have faced in building integrations for the Iteratively workflow?
How is schema evolution handled throughout the lifecycle of an event?
What are the challenges that engineers face in building effective integration tests for their event schemas?
What has been your biggest challenge in messaging for your platform and educating potential users of its benefits?
What are some of the most interesting or unexpected ways that you have seen Iteratively used?
What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Iteratively?
When is Iteratively the wrong choice?
What do you have planned for the future of Iteratively?
Contact Info
Patrick
LinkedIn
@Patrickt010 on Twitter
Website
Ondrej
LinkedIn
@ondrej421 on Twitter
ondrej on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Iteratively
Syncplicity
Locally Optimistic
DBT
Podcast Episode
Snowplow Analytics
Podcast Episode
JSON Schema
Master Data Management
Podcast Episode
SDLC == Software Development Life Cycle
Amplitude
Mixpanel
Mode Analytics
CRUD == Create, Read, Update, Delete
Segment
Podcast Episode
Schemaver (JSON Schema Versioning Strategy)
Great Expectations
Podcast.init Interview
Data Engineering Podcast Interview
Confluence
Notion
Confluent Schema Registry
Podcast Episode
Snowplow Iglu Schema Registry
Pulsar Schema Registry
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 4, 2020 • 1h 1min
A Practical Introduction To Graph Data Applications
Summary
Finding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding emergent structures in your information. In this episode Denise Gosnell and Matthias Broecheler discuss their recent book, the Practitioner’s Guide To Graph Data, including the fundamental principles that you need to know about graph structures, the current state of graph support in database engines, tooling, and query languages, as well as useful tips on potential pitfalls when putting them into production. This was an informative and enlightening conversation with two experts on graph data applications that will help you start on the right track in your own projects.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Denise Gosnell and Matthias Broecheler about the recently published practitioner’s guide to graph data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what your goals are for the Practitioner’s Guide To Graph Data?
What was your motivation for writing a book to address this topic?
What do you see as the driving force behind the growing popularity of graph technologies in recent years?
What are some of the common use cases/applications of graph data and graph traversal algorithms?
What are the core elements of graph thinking that data teams need to be aware of to be effective in identifying those cases in their existing systems?
What are the fundamental principles of graph technologies that data engineers should be familiar with?
What are the core modeling principles that they need to know for designing schemas in a graph database?
Beyond databases, what are some of the other components of the data stack that can or should handle graphs natively?
Do you typically use a graph database as the primary or complementary data store?
What are some of the common challenges that you see when bringing graph applications to production?
What have you found to be some of the common points of confusion or error prone aspects of implementing and maintaining graph oriented applications?
When it comes to the specific technologies of different graph databases, what are some of the edge cases/variances in the interfaces or modeling capabilities that they present?
How does the variation in query languages impact the overall adoption of these technologies?
What are your thoughts on the recent standardization of GSQL as an ANSI specification?
What are some of the scaling challenges that exist for graph data engines?
What are the ongoing developments/improvements/trends in graph technology that you are most excited about?
What are some of the shortcomings in existing technology/ecosystem for graph applications that you would like to see addressed?
What are some of the cases where a graph is the wrong abstraction for a data project?
What are some of the other resources that you recommend for anyone who wants to learn more about the various aspects of graph data?
Contact Info
Denise
LinkedIn
@DeniseKGosnell on Twitter
Matthias
LinkedIn
@MBroecheler on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The Practitioner’s Guide To Graph Data
Datastax
Titan graph database
Goethe
Graph Database
NoSQL
Relational Database
Elasticsearch
Podcast Episode
Associative Array Data Structure
RDF Triple
Datastax Multi-model Graph Database
Semantic Web
Gremlin Graph Query Language
Super Node
Neuromorphic Computing
Datastax Desktop
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 28, 2020 • 50min
Build More Reliable Distributed Systems By Breaking Them With Jepsen
Summary
A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of distributed data processing systems and identifying when and why they break. In this episode he shares his approach to testing complex systems, the common challenges that are faced by engineers who build them, and why it is important to understand their limitations. This was a great look at some of the underlying principles that power your mission critical workloads.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Kyle Kingsbury about his work on the Jepsen testing framework and the failure modes of distributed systems
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what the Jepsen project is?
What was your inspiration for starting the project?
What other methods are available for evaluating and stress testing distributed systems?
What are some of the common misconceptions or misunderstanding of distributed systems guarantees and how they impact real world usage of things like databases?
How do you approach the design of a test suite for a new distributed system?
What is your heuristic for determining the completeness of your test suite?
What are some of the common challenges of setting up a representative deployment for testing?
Can you walk through the workflow of setting up, running, and evaluating the output of a Jepsen test?
How is Jepsen implemented?
How has the design evolved since you first began working on it?
What are the pros and cons of using Clojure for building Jepsen?
If you were to start over today on the Jepsen framework what would you do differently?
What are some of the most common failure modes that you have identified in the platforms that you have tested?
What have you found to be the most difficult to resolve distributed systems bugs?
What are some of the interesting developments in distributed systems design that you are keeping an eye on?
How do you perceive the impact that Jepsen has had on modern distributed systems products?
What have you found to be the most interesting, unexpected, or challenging lessons learned while building Jepsen and evaluating mission critical systems?
What do you have planned for the future of the Jepsen framework?
Contact Info
aphyr on GitHub
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Jepsen
Riak
Distributed Systems
TLA+
Coq
Isabelle
Cassandra DTest
FoundationDB
Podcast Episode
CRDT == Conflict-free Replicated Data-type
Podcast Episode
Riemann
Clojure
JVM == Java Virtual Machine
Kotlin
Haskell
Scala
Groovy
TiDB
YugabyteDB
Podcast Episode
CockroachDB
Podcast Episode
Raft consensus algorithm
Paxos
Leslie Lamport
Calvin
FaunaDB
Podcast Episode
Heidi Howard
CALM Conjecture
Causal Consistency
Hillel Wayne
Christopher Meiklejohn
Distsys Class
Distributed Systems For Fun And Profit by
Mikito Takada
Christopher Meiklejohn Reading List
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 21, 2020 • 41min
Making Wind Energy More Efficient With Data At Turbit Systems
Summary
Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify and correct problems that contribute to suboptimal power outputs. In this episode he shares the story of how he got started working with wind energy, the system that he has built to collect data from the individual turbines, and how he is using machine learning to provide valuable insights to produce higher energy outputs. This was a great conversation about using data to improve the way the world works.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Michael Tegtmeier about Turbit, a machine learning powered platform for performance monitoring of wind farms
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Turbit and your motivation for creating the business?
What are the most problematic factors that contribute to low performance in power generation with wind turbines?
What is the current state of the art for accessing and analyzing data for wind farms?
What information are you able to gather from the SCADA systems in the turbine?
How uniform is the availability and formatting of data from different manufacturers?
How are you handling data collection for the individual turbines?
How much information are you processing at the point of collection vs. sending to a centralized data store?
Can you describe the system architecture of Turbit and the lifecycle of turbine data as it propagates from collection to analysis?
How do you incorporate domain knowledge into the identification of useful data and how it is used in the resultant models?
What are some of the most challenging aspects of building an analytics product for the wind energy sector?
What have you found to be the most interesting, unexpected, or challenging aspects of building and growing Turbit?
What do you have planned for the future of the technology and business?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Turbit Systems
LIDAR
Pulse Shaping
Wind Turbine
SCADA
Genetic Algorithm
Bremen Germany
Pitch
Yaw
Nacelle
Anemometer
Neural Network
Swarm64
Podcast Episode
Tensorflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 13, 2020 • 1h 5min
Open Source Production Grade Data Integration With Meltano
Summary
The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Douwe Maan about Meltano, an open source platform for building, running & orchestrating ELT pipelines.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Meltano is and the story behind it?
Who is the target audience?
How does the focus on small or early stage organizations constrain the architectural decisions that go into Meltano?
What have you found to be the complexities in trying to encapsulate the entirety of the data lifecycle in a single tool or platform?
What are the most painful transitions in that lifecycle and how does that pain manifest?
How and why has the focus of the project shifted from its original vision?
With your current focus on the data integration/data transfer stage of the lifecycle, what are you seeing as the biggest barriers to entry with the current ecosystem?
What are the main elements of your strategy to address these barriers?
How is the Meltano platform in its current incarnation implemented?
How much of the original architecture have you been able to retain, and how have you evolved it to align with your new direction?
What have you found to be the challenges that your users face when going from the easy on-ramp of local execution to then trying to scale and customize their pipelines for production use?
What are the most critical features that you are focusing on building now to make Meltano competitive with managed platforms?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Meltano?
When is Meltano the wrong choice?
What is your broad vision for the future of Meltano?
What are the most immediate needs for contribution that will help you realize that vision?
Contact Info
Website
DouweM on GitLab
DouweM on GitHub
@DouweM on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Meltano
GitLab
Mexico City
Netherlands
Locally Optimistic
Singer
Stitch Data
DBT
ELT
Informatica
Version Control
Code Review
CI/CD
Jupyter Notebook
LookML
Meltano Modeling Syntax
Redash
Metabase
Apache Superset
Apache Airflow
Luigi
Prefect
Dagster
Transferwise
Pipelinewise
12 Factor Application
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 6, 2020 • 46min
DataOps For Streaming Systems With Lenses.io
Summary
There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Lenses is and the story behind it?
What is your working definition for what constitutes DataOps?
How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?
What are the typical barriers to collaboration, and how does Lenses help with that?
Many different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it?
What are the main challenges that you see engineers facing when working with streaming systems?
What have you found to be the most notable evolutions in the community and ecosystem around Kafka and streaming platforms?
One of the interesting features in the recent release is support for topologies to map out the relations between different producers and consumers across a stream. Why is that a difficult problem and how have you approached it?
On the point of monitoring, what are the foundational challenges that engineers run into when trying to gain visibility into streams of data?
What are some useful strategies for collecting and analyzing traces of data flows?
As with many things in the space of data, local development and pre-production testing and validation are complicated due to the potential scale and variability of a production system. What advice do you have for engineers who are trying to establish a sustainable workflow for streaming applications?
How do you facilitate the CI/CD process for enabling a culture of testing and establishing confidence in the correct functionality of your systems?
How is the Lenses platform implemented and how has its design evolved since you first began working on it?
What are some of the specifics of Kafka that you have had to reconsider or redesign as you began adding support for additional streaming engines (e.g. Redis and Pulsar)?
What are some of the most interesting, unexpected, or innovative ways that you have seen the Lenses platform used?
What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with Lenses?
When is Lenses the wrong choice?
What do you have planned for the future of the platform?
Contact Info
LinkedIn
@StevensonA_D on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Lenses.io
Babylon Health
DevOps
DataOps
GitOps
Apache Calcite
kSQL
Kafka Connect Query Language
Apache Flink
Podcast Episode
Apache Spark
Podcast Episode
Apache Pulsar
Podcast Episode
StreamNative Episode
Playtika
Riskfuel(?)
JMX Metrics
Amazon MSK (Managed Streaming for Kafka)
Prometheus
Canary Deployment
Kafka on Pulsar
Data Catalog
Data Mesh
Podcast Episode
Dagster
Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 30, 2020 • 57min
Data Collection And Management To Power Sound Recognition At Audio Analytic
Summary
We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Audio Analytic?
What was your motivation for building an AI platform for sound recognition?
What are some of the ways that your platform is being used?
What are the unique challenges that you have faced in working with arbitrary sound data?
How do you handle the collection and labelling of the source data that you rely on for building your models?
Beyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with?
How has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models?
challenges of building an embeddable AI model
update cycle
difficulty of identifying relevant audio and dealing with literal noise in the input data
rights and ownership challenges in collection of source data
What was your design process for constructing a pipeline for the audio data that you need to process?
Can you describe how your overall data management system is architected?
How has that architecture evolved since you first began building and using it?
A majority of data tools are oriented around, and optimized for, collection and processing of textual data. How much off-the-shelf technology have you been able to use for working with audio?
What are some of the assumptions that you made at the start which have been shown to be inaccurate or in need of reconsidering?
How do you address variability in the duration of source samples in the processing pipeline?
How much of an issue do you face as a result of the variable quality of microphones in the embedded devices where the model is being run?
What are the limitations of the model in dealng with complex and layered audio environments?
How has the testing and evaluation of your model fed back into your strategies for collecting source data?
What are some of the weirdest or most unusual sounds that you have worked with?
What have been the most interesting, unexpected, or challenging lessons that you have learned in the process of building the technology and business of Audio Analytic?
What do you have planned for the future of the company?
Contact Info
Chris
LinkedIn
Thomas
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Audio Analytic
Twitter
Anechoic Chamber
EXIF Data
ID3 Tags
Polyphonic Sound Detection Score
GitHub Repository
ICASSP
CES
MO+ ARM Processor
Context Systems Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast