

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Aug 12, 2019 • 45min
Digging Into Data Replication At Fivetran
Summary
The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the problem that Fivetran solves and the story of how it got started?
Integration of multiple data sources (e.g. entity resolution)
How is Fivetran architected and how has the overall system design changed since you first began working on it?
monitoring and alerting
Automated schema normalization. How does it work for customized data sources?
Managing schema drift while avoiding data loss
Change data capture
What have you found to be the most complex or challenging data sources to work with reliably?
Workflow for users getting started with Fivetran
When is Fivetran the wrong choice for collecting and analyzing your data?
What have you found to be the most challenging aspects of working in the space of data integrations?}}
What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?
What do you have planned for the future of Fivetran?
Contact Info
LinkedIn
@frasergeorgew on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Fivetran
Ralph Kimball
DBT (Data Build Tool)
Podcast Interview
Looker
Podcast Interview
Cron
Kubernetes
Postgres
Podcast Episode
Oracle DB
Salesforce
Netsuite
Marketo
Jira
Asana
Cloudwatch
Stackdriver
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 5, 2019 • 52min
Solving Data Discovery At Lyft
Summary
Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself.
Announcements
Welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Finding the data that you need is tricky, and Amundsen will help you solve that problem. And as your data grows in volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined. Mode – the advanced analytics platform that Lyft trusts – has compiled 3 reasons to rethink data discovery. Read them at dataengineeringpodcast.com/mode-lyft.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, the Open Data Science Conference, and Corinium Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Mark Grover and Tao Feng about Amundsen, the data discovery platform and metadata engine that powers self service data access at Lyft
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Amundsen is and the problems that it was designed to address?
What was lacking in the existing projects at the time that led you to building a new platform from the ground up?
How does Amundsen fit in the larger ecosystem of data tools?
How does it compare to what WeWork is building with Marquez?
Can you describe the overall architecture of Amundsen and how it has evolved since you began working on it?
What were the main assumptions that you had going into this project and how have they been challenged or updated in the process of building and using it?
What has been the impact of Amundsen on the workflows of data teams at Lyft?
Can you talk through an example workflow for someone using Amundsen?
Once a dataset has been located, how does Amundsen simplify the process of accessing that data for analysis or further processing?
How does the information in Amundsen get populated and what is the process for keeping it up to date?
What was your motivation for releasing it as open source and how much effort was involved in cleaning up the code for the public?
What are some of the capabilities that you have intentionally decided not to implement yet?
For someone who wants to run their own instance of Amundsen what is involved in getting it deployed and integrated?
What have you found to be the most challenging aspects of building, using and maintaining Amundsen?
What do you have planned for the future of Amundsen?
Contact Info
Tao
LinkedIn
feng-tao on GitHub
Mark
LinkedIn
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Amundsen
Data Council Presentation
Strata Presentation
Blog Post
Lyft
Airflow
Podcast.__init__ Episode
LinkedIn
Slack
Marquez
S3
Hive
Presto
Podcast Episode
Spark
PostgreSQL
Google BigQuery
Neo4J
Apache Atlas
Tableau
Superset
Alation
Cloudera Navigator
DynamoDB
MongoDB
Druid
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 29, 2019 • 54min
Simplifying Data Integration Through Eventual Connectivity
Summary
The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL
Interview
Introduction
How did you get involved in the area of data management?
Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?
What is eventual connectivity and how does it address the problems with ETL in the current data landscape?
In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?
How do different implementations of graph databases impact their viability for this use case?
Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?
How much up-front modeling is necessary to make this a viable approach to data integration?
How do the volume and format of the source data impact the technology and architecture decisions that you would make?
What are the limitations or edge cases that you have found when using this pattern?
In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?
What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?
Contact Info
Email
LinkedIn
@jerrong on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Eventual Connectivity White Paper
CluedIn
Podcast Episode
Copenhagen
Ewok
Multivariate Testing
CRM
ERP
ETL
ELT
DAG
Graph Database
Apache NiFi
Podcast Episode
Apache Airflow
Podcast.init Episode
BigQuery
RedShift
CosmosDB
SAP HANA
IOT == Internet of Things
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 22, 2019 • 1h 4min
Straining Your Data Lake Through A Data Mesh
Summary
The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management
Interview
Introduction
How did you get involved in the area of data management?
Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose?
What are some of the organizational and industry trends that tend to lead to this solution?
You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?
In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?
What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake?
One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?
A corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization?
Who is responsible for implementing and enforcing compliance regimes?
One of the intended benefits of data lakes is the idea that data integration becomes easier by having everything in one place. What has been your experience in that regard?
How do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution?
Has latency of data retrieval proven to be an issue in your work?
When it comes to the actual implementation of a data mesh, can you describe the technical and organizational approach that you recommend?
How do team structures and dynamics shift in this scenario?
What are the necessary skills for each team?
Who is responsible for the overall lifecycle of the data in each domain, including modeling considerations and application design for how the source data is generated and captured?
Is there a general scale of organization or problem domain where this approach would generate too much overhead and maintenance burden?
For an organization that has an existing monolothic architecture, how do you suggest they approach decomposing their data into separately managed domains?
Are there any other architectural considerations that data professionals should be considering that aren’t yet widespread?
Contact Info
LinkedIn
@zhamakd on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Thoughtworks
Technology Radar
Data Lake
Data Warehouse
James Dixon
Azure Data Lake
"Big Ball Of Mud" Anti-Pattern
ETL
ELT
Hadoop
Spark
Kafka
Event Sourcing
Airflow
Podcast.__init__ Episode
Data Engineering Episode
Data Catalog
Master Data Management
Podcast Episode
Polyseme
REST
CNCF (Cloud Native Computing Foundation)
Cloud Events Standard
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 15, 2019 • 58min
Data Labeling That You Can Feel Good About With CloudFactory
Summary
Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what CloudFactory is and the story behind it?
What are some of the common requirements for feature extraction and data labelling that your customers contact you for?
What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?
Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?
What protocols do you have in place to ensure data quality and identify potential sources of bias?
What role do humans play in the lifecycle for AI and ML projects?
I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?
How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?
Can you share some stories of cloud workers who have benefited from their experience working with your company?
What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?
What have been some of the most interesting/unexpected ways that you have seen customers using your platform?
What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?
What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?
How does that tie into your plans for CloudFactory in the medium to long term?
Contact Info
@marktsears on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
CloudFactory
Reading, UK
Nepal
Kenya
Ruby on Rails
Kathmandu
Natural Language Processing (NLP)
Computer Vision
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 8, 2019 • 1h 11min
Scale Your Analytics On The Clickhouse Data Warehouse
Summary
The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Clickhouse is and how you each got involved with it?
What are the primary use cases that Clickhouse is targeting?
Where does it fit in the database market and how does it compare to other column stores, both open source and commercial?
Can you describe how Clickhouse is architected?
Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time?
I noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse?
Aside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse?
For the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse?
What are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse?
For someone getting started with Clickhouse can you describe how they should be thinking about data modeling?
Recent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum?
How is scaling in Clickhouse implemented and what are the edge cases that users should be aware of?
How is data replication and consistency managed?
What is involved in deploying and maintaining an installation of Clickhouse?
I noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse?
What are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used?
What are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it?
What are the shortcomings of Clickhouse and how do you address them at Altinity?
When is Clickhouse the wrong choice?
Contact Info
Robert
LinkedIn
hodgesrm on GitHub
Alexander
alex-zaitsev on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Clickhouse
Altinity
OLAP
M204
Sybase
MySQL
Vertica
Yandex
Yandex Metrica
Google Analytics
SQL
Greenplum
InfoBright
InfiniDB
MariaDB
Spark
SIMD (Single Instruction, Multiple Data)
Mergesort
ETL
Change Data Capture
MapReduce
KDB
OLTP
Cassandra
InfluxDB
Prometheus
SnowflakeDB
Hive
Hadoop
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 2, 2019 • 38min
Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection
Summary
Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?
What are some example cases where anomaly detection is useful or necessary?
Once you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture?
What was your selection criteria for the various components of your system design?
What tools and technologies did you consider in your initial assessment and which did you ultimately converge on?
If you were to start over today would you do any of it differently?
Can you talk through the algorithm that you used for detecting anomalous activity?
What is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space?
What were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion?
What were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances?
How did those bottlenecks differ as you moved through different levels of scale?
What were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built?
What were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system?
How have those lessons fed back to your work at Instaclustr?
Contact Info
LinkedIn
@paulbrebner_ on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Instaclustr
Kafka
Cassandra
Canberra, Australia
Spark
Anomaly Detection
Kubernetes
Prometheus
OpenTracing
Jaeger
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 25, 2019 • 1h 8min
The Workflow Engine For Data Engineers And Data Scientists
Summary
Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Prefect is and your motivation for creating it?
What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect?
In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you?
How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration?
How do you manage passing data between stages in a pipeline when they are running across distributed nodes?
What was your decision making process when deciding to use Dask as your supported execution engine?
For tasks that require specific resources or dependencies how do you approach the idea of task affinity?
Does Prefect support managing tasks that bridge network boundaries?
What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often?
What are the limitations of the open source core as compared to the cloud offering that you are building?
What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users?
What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used?
When is Prefect the wrong choice?
In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?
What are some best practices and industry trends that you are most excited by?
What do you have planned for the future of the Prefect project and company?
Contact Info
LinkedIn
@jlowin on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Prefect
Airflow
Dask
Podcast Episode
Prefect Blog
PyData Presentation
Tensorflow
Workflow Engine
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 17, 2019 • 51min
Maintaining Your Data Lake At Scale With Spark
Summary
Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Delta Lake is and the motivation for creating it?
What are some of the common antipatterns in data lake implementations and how does Delta Lake address them?
What are the benefits of a data lake over a data warehouse?
How has that equation changed in recent years with the availability of modern cloud data warehouses?
How is Delta lake implemented and how has the design evolved since you first began working on it?
What assumptions did you have going into the project and how have they been challenged as it has gained users?
One of the compelling features is the option for enforcing data quality constraints. Can you talk through how those are defined and tested?
In your experience, how do you manage schema evolution when working with large volumes of data? (e.g. rewriting all of the old files, or just eliding the missing columns/populating default values, etc.)
Can you talk through how Delta Lake manages transactionality and data ownership? (e.g. what if you have other services interacting with the data store)
Are there limits in terms of the volume of data that can be managed within a single transaction?
How does unifying the interface for Spark to interact with batch and streaming data sets simplify the workflow for an end user?
The Lambda architecture was popular in the early days of Hadoop but seems to have fallen out of favor. How does this unified interface resolve the shortcomings and complexities of that approach?
What have been the most difficult/complex/challenging aspects of building Delta Lake?
How is the data versioning in Delta Lake implemented?
By keeping a copy of all iterations of a data set there is the opportunity for a great deal of additional cost. What are some options for mitigating that impact, either in Delta Lake itself or as a separate mechanism or process?
What are the reasons for standardizing on Parquet as the storage format?
What are some of the cases where that has led to greater complications?
In addition to the transactionality and data validation that Delta Lake provides, can you also explain how indexing is implemented and highlight the challenges of keeping them up to date?
When is Delta Lake the wrong choice?
What problems did you consciously decide not to address?
What is in store for the future of Delta Lake?
Contact Info
LinkedIn
@michaelarmbrust on Twitter
marmbrus on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Delta Lake
DataBricks
Spark SQL
Microsoft SQL Server
Databricks Delta
Spark Summit
Apache Spark
Enterprise Data Curation Episode
Data Lake
Data Warehouse
SnowflakeDB
BigQuery
Parquet
Data Serialization Episode
Hive Metastore
Great Expectations
Podcast.__init__ Interview
Optimistic Concurrency/Optimistic Locking
Presto
Starburst Labs
Podcast Interview
Apache NiFi
Podcast Interview
Tensorflow
Tableau
Change Data Capture
Apache Pulsar
Podcast Interview
Pravega
Podcast Interview
Multi-Version Concurrency Control
MLFlow
Avro
ORC
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 10, 2019 • 1h 3min
Managing The Machine Learning Lifecycle
Summary
Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Hydrosphere is and share its origin story?
In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?
How does it differ from deployment and maintenance of a regular software application?
Can you describe how Hydrosphere is architected and how the different components of the stack fit together?
For someone who is using Hydrosphere in their production workflow, what would that look like?
What is the difference in interaction with Hydrosphere for different roles within a data team?
What are some of the types of metrics that you monitor to determine when and how to retrain deployed models?
Which metrics do you track for testing and verifying the health of the data?
What are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them?
How has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere?
How has that influenced the design and direction of Hydrosphere, both as a project and a business?
How has the design of Hydrosphere evolved since you first began working on it?
What assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform?
What have been some of the most challenging or complex aspects of building and maintaining Hydrosphere?
What do you have in store for the future of Hydrosphere?
Contact Info
LinkedIn
spushkarev on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Hydrosphere
GitHub
Data Engineering Podcast at ODSC
KD Nuggets
Big Data Science: Expectation vs. Reality
The Open Data Science Conference
Scala
InfluxDB
RocksDB
Docker
Kubernetes
Akka
Python Pickle
Protocol Buffers
Kubeflow
MLFlow
TensorFlow Extended
Kubeflow Pipelines
Argo
Airflow
Podcast.__init__ Interview
Envoy
Istio
DVC
Podcast.__init__ Interview
Generative Adversarial Networks
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast


