

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Sep 18, 2019 • 58min
Navigating Boundless Data Streams With The Swim Kernel
Summary
The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Swim.ai is and how the project and business got started?
Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?
What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable?
How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms?
Can you describe a typical design for an application or system being built on top of the Swim platform?
What does the developer workflow look like?
What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?
Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it?
For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality?
What mechanisms are in place to account for network failures?
Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application?
Since there is no explicit data layer, how is data redundancy handled by Swim applications?
What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used?
What have you found to be the most challenging aspects of building the Swim platform?
What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated?
What do you have planned for the future of the technical and business aspects of Swim.ai?
Contact Info
LinkedIn
Wikipedia
@simoncrosby on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Swim.ai
Hadoop
Streaming Data
Apache Flink
Podcast Episode
Apache Kafka
Wallaroo
Podcast Episode
Digital Twin
Swim Concepts Documentation
RFID == Radio Frequency IDentification
PCB == Printed Circuit Board
Graal VM
Azure IoT Edge Framework
Azure DLS (Data Lake Storage)
Power BI
WARP Protocol
LightBend
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 10, 2019 • 55min
Building A Reliable And Performant Router For Observability Data
Summary
The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Ben Johnson and Luke Steensen about Vector, a high-performance, open-source observability data router
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what the Vector project is and your reason for creating it?
What are some of the comparable tools that are available and what were they lacking that prompted you to start a new project?
What strategy are you using for project governance and sustainability?
What are the main use cases that Vector enables?
Can you explain how Vector is implemented and how the system design has evolved since you began working on it?
How did your experience building the business and products for Timber influence and inform your work on Vector?
When you were planning the implementation, what were your criteria for the runtime implementation and why did you decide to use Rust?
What led you to choose Lua as the embedded scripting environment?
What data format does Vector use internally?
Is there any support for defining and enforcing schemas?
In the event of a malformed message is there any capacity for a dead letter queue?
What are some strategies for formatting source data to improve the effectiveness of the information that is gathered and the ability of Vector to parse it into useful data?
When designing an event flow in Vector what are the available mechanisms for testing the overall delivery and any transformations?
What options are available to operators to support visibility into the running system?
In terms of deployment topologies, what capabilities does Vector have to support high availability and/or data redundancy?
What are some of the other considerations that operators and administrators of Vector should be considering?
You have a fairly well defined roadmap for the different point versions of Vector. How did you determine what the priority ordering was and how quickly are you progressing on your roadmap?
What is the available interface for adding and extending the capabilities of Vector? (source/transform/sink)
What are some of the most interesting/innovative/unexpected ways that you have seen Vector used?
What are some of the challenges that you have faced in building/publicizing Vector?
For someone who is interested in using Vector, how would you characterize the overall maturity of the project currently?
What is missing that you would consider necessary for production readiness?
When is Vector the wrong choice?
Contact Info
Ben
@binarylogic on Twitter
binarylogic on GitHub
Luke
LinkedIn
@lukesteensen on Twitter
lukesteensen on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Vector
GitHub
Timber.io
Observability
SeatGeek
Apache Kafka
StatsD
FluentD
Splunk
Filebeat
Logstash
Fluent Bit
Rust
Tokio Rust library
TOML
Lua
Nginx
HAProxy
Web Assembly (WASM)
Protocol Buffers
Jepsen
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 2, 2019 • 53min
Building A Community For Data Professionals at Data Council
Summary
Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies
Interview
Introduction
How did you get involved in the area of data management?
What was your original reason for focusing your efforts on fostering a community of data engineers?
What was the state of recognition in the industry for that role at the time that you began your efforts?
The current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community?
How has the community itself changed and grown over the past few years?
Communities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data?
Where do you draw inspiration and direction for how to manage such a large and distributed community?
What are some of the most interesting/challenging/unexpected aspects of community management that you have encountered?
What are some ways that you have been surprised or delighted in your interactions with the data community?
How do you approach sustainability of the Data Council community and the organization itself?
The tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction?
In addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission?
You also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them?
What are the ways that you typically work to empower engineering founders or encourage them to create their own businesses?
What are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business?
What are your opinions on venture-backed vs. "lifestyle" or bootstrapped businesses?
What are the characteristics of a data business that you look at when evaluating a potential investment?
What are some of the current industry trends that you are most excited by?
What are some that you find concerning?
What are your goals and plans for the future of Data Council?
Contact Info
@petesoder on Twitter
LinkedIn
@petesoder on Medium
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Data Council
Database Design For Mere Mortals
Bloomberg
Garmin
500 Startups
Geeks On A Plane
Data Council NYC 2019 Track Summary
Pete’s Angel List Syndicate
DataOps
Data Kitchen Episode
DataOps Vs DevOps Episode
Great Expectations
Podcast.__init__ Interview
Elementl
Dagster
Data Council Presentation
Data Council Call For Proposals
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 26, 2019 • 48min
Building Tools And Platforms For Data Analytics
Summary
Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Benn Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing some of the main features that you are looking for in the tools that you use?
What are some of the common shortcomings that you have found in out-of-the-box tools that organizations use to build their data stack?
What should data engineers be considering as they design and implement the foundational data platforms that higher order systems are built on, which are ultimately used by analysts and data scientists?
In terms of mindset, what are the ways that data engineers and analysts can align and where are the points of conflict?
In terms of team and organizational structure, what have you found to be useful patterns for reducing friction in the product lifecycle for data tools (internal or external)?
What are some anti-patterns that data engineers can guard against as they are designing their pipelines?
In your experience as an analyst, what have been the characteristics of the most seamless projects that you have been involved with?
How much understanding of analytics are necessary for data engineers to be successful in their projects and careers?
Conversely, how much understanding of data management should analysts have?
What are the industry trends that you are most excited by as an analyst?
Contact Info
LinkedIn
@bennstancil on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Mode Analytics
Data Council Presentation
Yammer
StitchFix Blog Post
SnowflakeDB
Re:Dash
Superset
Marquez
Amundsen
Podcast Episode
Elementl
Dagster
Data Council Presentation
DBT
Podcast Episode
Great Expectations
Podcast.__init__ Episode
Delta Lake
Podcast Episode
Stitch
Fivetran
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 19, 2019 • 1h 14min
A High Performance Platform For The Full Big Data Lifecycle
Summary
Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?
What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?
Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained?
Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?
For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components?
How does HPCC Systems manage persistence and scalability?
What are the integration points available for extending and enhancing the HPCC Systems platform?
What is involved in deploying and managing a production installation of HPCC Systems?
The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?
How does the Thor engine manage data transformation and manipulation?
What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?
For extraction and analysis of data can you talk through the capabilities of the Roxie engine?
How are you using the HPCC Systems platform in your work at LexisNexis?
Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade?
How is the HPCC Systems project governed, and what is your approach to sustainability?
What are some of the additional capabilities that are only available in the enterprise distribution?
When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead?
What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used?
What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community?
What do you have planned for the future of HPCC Systems?
Contact Info
LinkedIn
@fvillanustre on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
HPCC Systems
LexisNexis Risk Solutions
Risk Management
Hadoop
MapReduce
Sybase
Oracle DB
AbInitio
Data Lake
SQL
ECL
DataFlow
TensorFlow
ECL IDE
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 12, 2019 • 45min
Digging Into Data Replication At Fivetran
Summary
The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the problem that Fivetran solves and the story of how it got started?
Integration of multiple data sources (e.g. entity resolution)
How is Fivetran architected and how has the overall system design changed since you first began working on it?
monitoring and alerting
Automated schema normalization. How does it work for customized data sources?
Managing schema drift while avoiding data loss
Change data capture
What have you found to be the most complex or challenging data sources to work with reliably?
Workflow for users getting started with Fivetran
When is Fivetran the wrong choice for collecting and analyzing your data?
What have you found to be the most challenging aspects of working in the space of data integrations?}}
What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?
What do you have planned for the future of Fivetran?
Contact Info
LinkedIn
@frasergeorgew on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Fivetran
Ralph Kimball
DBT (Data Build Tool)
Podcast Interview
Looker
Podcast Interview
Cron
Kubernetes
Postgres
Podcast Episode
Oracle DB
Salesforce
Netsuite
Marketo
Jira
Asana
Cloudwatch
Stackdriver
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 5, 2019 • 52min
Solving Data Discovery At Lyft
Summary
Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself.
Announcements
Welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Finding the data that you need is tricky, and Amundsen will help you solve that problem. And as your data grows in volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined. Mode – the advanced analytics platform that Lyft trusts – has compiled 3 reasons to rethink data discovery. Read them at dataengineeringpodcast.com/mode-lyft.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, the Open Data Science Conference, and Corinium Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Mark Grover and Tao Feng about Amundsen, the data discovery platform and metadata engine that powers self service data access at Lyft
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Amundsen is and the problems that it was designed to address?
What was lacking in the existing projects at the time that led you to building a new platform from the ground up?
How does Amundsen fit in the larger ecosystem of data tools?
How does it compare to what WeWork is building with Marquez?
Can you describe the overall architecture of Amundsen and how it has evolved since you began working on it?
What were the main assumptions that you had going into this project and how have they been challenged or updated in the process of building and using it?
What has been the impact of Amundsen on the workflows of data teams at Lyft?
Can you talk through an example workflow for someone using Amundsen?
Once a dataset has been located, how does Amundsen simplify the process of accessing that data for analysis or further processing?
How does the information in Amundsen get populated and what is the process for keeping it up to date?
What was your motivation for releasing it as open source and how much effort was involved in cleaning up the code for the public?
What are some of the capabilities that you have intentionally decided not to implement yet?
For someone who wants to run their own instance of Amundsen what is involved in getting it deployed and integrated?
What have you found to be the most challenging aspects of building, using and maintaining Amundsen?
What do you have planned for the future of Amundsen?
Contact Info
Tao
LinkedIn
feng-tao on GitHub
Mark
LinkedIn
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Amundsen
Data Council Presentation
Strata Presentation
Blog Post
Lyft
Airflow
Podcast.__init__ Episode
LinkedIn
Slack
Marquez
S3
Hive
Presto
Podcast Episode
Spark
PostgreSQL
Google BigQuery
Neo4J
Apache Atlas
Tableau
Superset
Alation
Cloudera Navigator
DynamoDB
MongoDB
Druid
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 29, 2019 • 54min
Simplifying Data Integration Through Eventual Connectivity
Summary
The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL
Interview
Introduction
How did you get involved in the area of data management?
Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?
What is eventual connectivity and how does it address the problems with ETL in the current data landscape?
In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?
How do different implementations of graph databases impact their viability for this use case?
Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?
How much up-front modeling is necessary to make this a viable approach to data integration?
How do the volume and format of the source data impact the technology and architecture decisions that you would make?
What are the limitations or edge cases that you have found when using this pattern?
In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?
What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?
Contact Info
Email
LinkedIn
@jerrong on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Eventual Connectivity White Paper
CluedIn
Podcast Episode
Copenhagen
Ewok
Multivariate Testing
CRM
ERP
ETL
ELT
DAG
Graph Database
Apache NiFi
Podcast Episode
Apache Airflow
Podcast.init Episode
BigQuery
RedShift
CosmosDB
SAP HANA
IOT == Internet of Things
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 22, 2019 • 1h 4min
Straining Your Data Lake Through A Data Mesh
Summary
The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management
Interview
Introduction
How did you get involved in the area of data management?
Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose?
What are some of the organizational and industry trends that tend to lead to this solution?
You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?
In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?
What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake?
One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?
A corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization?
Who is responsible for implementing and enforcing compliance regimes?
One of the intended benefits of data lakes is the idea that data integration becomes easier by having everything in one place. What has been your experience in that regard?
How do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution?
Has latency of data retrieval proven to be an issue in your work?
When it comes to the actual implementation of a data mesh, can you describe the technical and organizational approach that you recommend?
How do team structures and dynamics shift in this scenario?
What are the necessary skills for each team?
Who is responsible for the overall lifecycle of the data in each domain, including modeling considerations and application design for how the source data is generated and captured?
Is there a general scale of organization or problem domain where this approach would generate too much overhead and maintenance burden?
For an organization that has an existing monolothic architecture, how do you suggest they approach decomposing their data into separately managed domains?
Are there any other architectural considerations that data professionals should be considering that aren’t yet widespread?
Contact Info
LinkedIn
@zhamakd on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Thoughtworks
Technology Radar
Data Lake
Data Warehouse
James Dixon
Azure Data Lake
"Big Ball Of Mud" Anti-Pattern
ETL
ELT
Hadoop
Spark
Kafka
Event Sourcing
Airflow
Podcast.__init__ Episode
Data Engineering Episode
Data Catalog
Master Data Management
Podcast Episode
Polyseme
REST
CNCF (Cloud Native Computing Foundation)
Cloud Events Standard
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 15, 2019 • 58min
Data Labeling That You Can Feel Good About With CloudFactory
Summary
Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what CloudFactory is and the story behind it?
What are some of the common requirements for feature extraction and data labelling that your customers contact you for?
What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?
Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?
What protocols do you have in place to ensure data quality and identify potential sources of bias?
What role do humans play in the lifecycle for AI and ML projects?
I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?
How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?
Can you share some stories of cloud workers who have benefited from their experience working with your company?
What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?
What have been some of the most interesting/unexpected ways that you have seen customers using your platform?
What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?
What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?
How does that tie into your plans for CloudFactory in the medium to long term?
Contact Info
@marktsears on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
CloudFactory
Reading, UK
Nepal
Kenya
Ruby on Rails
Kathmandu
Natural Language Processing (NLP)
Computer Vision
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast