

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Jul 28, 2020 • 50min
Build More Reliable Distributed Systems By Breaking Them With Jepsen
Summary
A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of distributed data processing systems and identifying when and why they break. In this episode he shares his approach to testing complex systems, the common challenges that are faced by engineers who build them, and why it is important to understand their limitations. This was a great look at some of the underlying principles that power your mission critical workloads.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Kyle Kingsbury about his work on the Jepsen testing framework and the failure modes of distributed systems
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what the Jepsen project is?
What was your inspiration for starting the project?
What other methods are available for evaluating and stress testing distributed systems?
What are some of the common misconceptions or misunderstanding of distributed systems guarantees and how they impact real world usage of things like databases?
How do you approach the design of a test suite for a new distributed system?
What is your heuristic for determining the completeness of your test suite?
What are some of the common challenges of setting up a representative deployment for testing?
Can you walk through the workflow of setting up, running, and evaluating the output of a Jepsen test?
How is Jepsen implemented?
How has the design evolved since you first began working on it?
What are the pros and cons of using Clojure for building Jepsen?
If you were to start over today on the Jepsen framework what would you do differently?
What are some of the most common failure modes that you have identified in the platforms that you have tested?
What have you found to be the most difficult to resolve distributed systems bugs?
What are some of the interesting developments in distributed systems design that you are keeping an eye on?
How do you perceive the impact that Jepsen has had on modern distributed systems products?
What have you found to be the most interesting, unexpected, or challenging lessons learned while building Jepsen and evaluating mission critical systems?
What do you have planned for the future of the Jepsen framework?
Contact Info
aphyr on GitHub
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Jepsen
Riak
Distributed Systems
TLA+
Coq
Isabelle
Cassandra DTest
FoundationDB
Podcast Episode
CRDT == Conflict-free Replicated Data-type
Podcast Episode
Riemann
Clojure
JVM == Java Virtual Machine
Kotlin
Haskell
Scala
Groovy
TiDB
YugabyteDB
Podcast Episode
CockroachDB
Podcast Episode
Raft consensus algorithm
Paxos
Leslie Lamport
Calvin
FaunaDB
Podcast Episode
Heidi Howard
CALM Conjecture
Causal Consistency
Hillel Wayne
Christopher Meiklejohn
Distsys Class
Distributed Systems For Fun And Profit by
Mikito Takada
Christopher Meiklejohn Reading List
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 21, 2020 • 41min
Making Wind Energy More Efficient With Data At Turbit Systems
Summary
Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify and correct problems that contribute to suboptimal power outputs. In this episode he shares the story of how he got started working with wind energy, the system that he has built to collect data from the individual turbines, and how he is using machine learning to provide valuable insights to produce higher energy outputs. This was a great conversation about using data to improve the way the world works.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Michael Tegtmeier about Turbit, a machine learning powered platform for performance monitoring of wind farms
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Turbit and your motivation for creating the business?
What are the most problematic factors that contribute to low performance in power generation with wind turbines?
What is the current state of the art for accessing and analyzing data for wind farms?
What information are you able to gather from the SCADA systems in the turbine?
How uniform is the availability and formatting of data from different manufacturers?
How are you handling data collection for the individual turbines?
How much information are you processing at the point of collection vs. sending to a centralized data store?
Can you describe the system architecture of Turbit and the lifecycle of turbine data as it propagates from collection to analysis?
How do you incorporate domain knowledge into the identification of useful data and how it is used in the resultant models?
What are some of the most challenging aspects of building an analytics product for the wind energy sector?
What have you found to be the most interesting, unexpected, or challenging aspects of building and growing Turbit?
What do you have planned for the future of the technology and business?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Turbit Systems
LIDAR
Pulse Shaping
Wind Turbine
SCADA
Genetic Algorithm
Bremen Germany
Pitch
Yaw
Nacelle
Anemometer
Neural Network
Swarm64
Podcast Episode
Tensorflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 13, 2020 • 1h 5min
Open Source Production Grade Data Integration With Meltano
Summary
The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Douwe Maan about Meltano, an open source platform for building, running & orchestrating ELT pipelines.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Meltano is and the story behind it?
Who is the target audience?
How does the focus on small or early stage organizations constrain the architectural decisions that go into Meltano?
What have you found to be the complexities in trying to encapsulate the entirety of the data lifecycle in a single tool or platform?
What are the most painful transitions in that lifecycle and how does that pain manifest?
How and why has the focus of the project shifted from its original vision?
With your current focus on the data integration/data transfer stage of the lifecycle, what are you seeing as the biggest barriers to entry with the current ecosystem?
What are the main elements of your strategy to address these barriers?
How is the Meltano platform in its current incarnation implemented?
How much of the original architecture have you been able to retain, and how have you evolved it to align with your new direction?
What have you found to be the challenges that your users face when going from the easy on-ramp of local execution to then trying to scale and customize their pipelines for production use?
What are the most critical features that you are focusing on building now to make Meltano competitive with managed platforms?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Meltano?
When is Meltano the wrong choice?
What is your broad vision for the future of Meltano?
What are the most immediate needs for contribution that will help you realize that vision?
Contact Info
Website
DouweM on GitLab
DouweM on GitHub
@DouweM on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Meltano
GitLab
Mexico City
Netherlands
Locally Optimistic
Singer
Stitch Data
DBT
ELT
Informatica
Version Control
Code Review
CI/CD
Jupyter Notebook
LookML
Meltano Modeling Syntax
Redash
Metabase
Apache Superset
Apache Airflow
Luigi
Prefect
Dagster
Transferwise
Pipelinewise
12 Factor Application
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 6, 2020 • 46min
DataOps For Streaming Systems With Lenses.io
Summary
There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Lenses is and the story behind it?
What is your working definition for what constitutes DataOps?
How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?
What are the typical barriers to collaboration, and how does Lenses help with that?
Many different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it?
What are the main challenges that you see engineers facing when working with streaming systems?
What have you found to be the most notable evolutions in the community and ecosystem around Kafka and streaming platforms?
One of the interesting features in the recent release is support for topologies to map out the relations between different producers and consumers across a stream. Why is that a difficult problem and how have you approached it?
On the point of monitoring, what are the foundational challenges that engineers run into when trying to gain visibility into streams of data?
What are some useful strategies for collecting and analyzing traces of data flows?
As with many things in the space of data, local development and pre-production testing and validation are complicated due to the potential scale and variability of a production system. What advice do you have for engineers who are trying to establish a sustainable workflow for streaming applications?
How do you facilitate the CI/CD process for enabling a culture of testing and establishing confidence in the correct functionality of your systems?
How is the Lenses platform implemented and how has its design evolved since you first began working on it?
What are some of the specifics of Kafka that you have had to reconsider or redesign as you began adding support for additional streaming engines (e.g. Redis and Pulsar)?
What are some of the most interesting, unexpected, or innovative ways that you have seen the Lenses platform used?
What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with Lenses?
When is Lenses the wrong choice?
What do you have planned for the future of the platform?
Contact Info
LinkedIn
@StevensonA_D on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Lenses.io
Babylon Health
DevOps
DataOps
GitOps
Apache Calcite
kSQL
Kafka Connect Query Language
Apache Flink
Podcast Episode
Apache Spark
Podcast Episode
Apache Pulsar
Podcast Episode
StreamNative Episode
Playtika
Riskfuel(?)
JMX Metrics
Amazon MSK (Managed Streaming for Kafka)
Prometheus
Canary Deployment
Kafka on Pulsar
Data Catalog
Data Mesh
Podcast Episode
Dagster
Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 30, 2020 • 57min
Data Collection And Management To Power Sound Recognition At Audio Analytic
Summary
We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Audio Analytic?
What was your motivation for building an AI platform for sound recognition?
What are some of the ways that your platform is being used?
What are the unique challenges that you have faced in working with arbitrary sound data?
How do you handle the collection and labelling of the source data that you rely on for building your models?
Beyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with?
How has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models?
challenges of building an embeddable AI model
update cycle
difficulty of identifying relevant audio and dealing with literal noise in the input data
rights and ownership challenges in collection of source data
What was your design process for constructing a pipeline for the audio data that you need to process?
Can you describe how your overall data management system is architected?
How has that architecture evolved since you first began building and using it?
A majority of data tools are oriented around, and optimized for, collection and processing of textual data. How much off-the-shelf technology have you been able to use for working with audio?
What are some of the assumptions that you made at the start which have been shown to be inaccurate or in need of reconsidering?
How do you address variability in the duration of source samples in the processing pipeline?
How much of an issue do you face as a result of the variable quality of microphones in the embedded devices where the model is being run?
What are the limitations of the model in dealng with complex and layered audio environments?
How has the testing and evaluation of your model fed back into your strategies for collecting source data?
What are some of the weirdest or most unusual sounds that you have worked with?
What have been the most interesting, unexpected, or challenging lessons that you have learned in the process of building the technology and business of Audio Analytic?
What do you have planned for the future of the company?
Contact Info
Chris
LinkedIn
Thomas
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Audio Analytic
Twitter
Anechoic Chamber
EXIF Data
ID3 Tags
Polyphonic Sound Detection Score
GitHub Repository
ICASSP
CES
MO+ ARM Processor
Context Systems Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 23, 2020 • 52min
Bringing Business Analytics To End Users With GoodData
Summary
The majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data increases and overall literacy in how to interpret it and take action improves there is a growing need to bring business intelligence use cases to a broader audience. GoodData is a platform focused on simplifying the work of bringing data to employees and end users. In this episode Sheila Jung and Philip Farr discuss how the GoodData platform is being used, how it is architected to provide scalable and performant analytics, and how it integrates into customer’s data platforms. This was an interesting conversation about a different approach to business intelligence and the importance of expanded access to data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
GoodData is revolutionizing the way in which companies provide analytics to their customers and partners. Start now with GoodData Free that makes our self-service analytics platform available to you at no cost. Register today at dataengineeringpodcast.com/gooddata
Your host is Tobias Macey and today I’m interviewing Sheila Jung and Philip Farr about how GoodData is building a platform that lets you share your analytics outside the boundaries of your organization
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at GoodData and some of its origin story?
The business intelligence market has been around for decades now and there are dozens of options with different areas of focus. What are the factors that might motivate me to choose GoodData over the other contenders in the space?
What are the use cases and industries that you focus on supporting with GoodData?
How has the market of business intelligence tools evolved in recent years?
What are the contributing trends in technology and business use cases that are driving that change?
What are some of the ways that your customers are embedding analytics into their own products?
What are the differences in processing and serving capabilities between an internally used business intelligence tool, and one that is used for embedding into externally used systems?
What unique challenges are posed by the embedded analytics use case?
How do you approach topics such as security, access control, and latency in a multitenant analytics platform?
What guidelines have you found to be most useful when addressing the concerns of accuracy and interpretability of the data being presented?
How is the GoodData platform architected?
What are the complexities that you have had to design around in order to provide performant access to your customers’ data sources in an interactive use case?
What are the off-the-shelf components that you have been able to integrate into the platform, and what are the driving factors for solutions that have been built specifically for the GoodData use case?
What is the process for your users to integrate GoodData into their existing data platform?
What is the workflow for someone building a data product in GoodData?
How does GoodData manage the lifecycle of the data that your customers are presenting to their end users?
How does GoodData integrate into the customer development lifecycle?
What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with GoodData?
Can you give an overview of the MAQL (Multi-Dimension Analytical Query Language) dialect that you use in GoodData and contrast it with SQL?
What are the benefits and additional functionality that MAQL provides?
When is GoodData the wrong choice?
What is on the roadmap for the future of GoodData?
Contact Info
Sheila
LinkedIn
Philip
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
GoodData
Teradata
ReactJS
SnowflakeDB
Podcast Episode
Redshift
BigQuery
SOC2
HIPAA
GDPR == General Data Protection Regulation
IoT == Internet of Things
SAML
Ruby
Multi-Dimension Analytical Query Language
Kubernetes
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 15, 2020 • 46min
Accelerate Your Machine Learning With The StreamSQL Feature Store
Summary
Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Your host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL
Interview
Introduction
How did you get involved in the areas of machine learning and data management?
What is StreamSQL and what motivated you to start the business?
Can you describe what a machine learning feature is?
What is the difference between generating features for training a model and generating features for serving?
How is feature management typically handled today?
What is a feature store and how is it different from the status quo?
What is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production?
How does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers?
What are the general requirements of a feature store?
What additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store?
How is discovery and documentation of features handled?
What is the current landscape of feature stores and how does StreamSQL compare?
How is the StreamSQL feature store implemented?
How is the supporting infrastructure architected and how has it evolved since you first began working on it?
Why is streaming data such a focal point of feature stores?
How do you generate features for training?
How do you approach monitoring of features and what does remediation look like for a feature that is no longer valid?
How do you handle versioning and deploying features?
What’s the process for integrating data sources into StreamSQL for processing into features?
How are the features materialized?
What are the most challenging or complex aspects of working on or with a feature store?
When is StreamSQL the wrong choice for a feature store?
What are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL?
What do you have planned for the future of the product?
Contact Info
LinkedIn
@simba_khadder on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
StreamSQL
Feature Stores for ML
Distributed Systems
Google Cloud Datastore
Triton
Uber Michelangelo
AirBnB Zipline
Lyft Dryft
Apache Flink
Podcast Episode
Apache Kafka
Spark Streaming
Apache Cassandra
Redis
Apache Pulsar
Podcast Episode
StreamNative Episode
TDD == Test Driven Development
Lyft presentation – Bootstrapping Flink
Go-Jek Feast
Hopsworks
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 8, 2020 • 55min
Data Management Trends From An Investor Perspective
Summary
The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar to get you up and running in no time. With simple pricing, fast networking, S3 compatible object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
Your host is Tobias Macey and today I’m interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of Redpoint Ventures and your role there?
From an investor perspective, what is most appealing about the category of data-oriented businesses?
What are the main sources of information that you rely on to keep up to date with what is happening in the data industry?
What is your personal heuristic for determining the relevance of any given piece of information to decide whether it is worthy of further investigation?
As someone who works closely with a variety of companies across different industry verticals and different areas of focus, what are some of the common trends that you have identified in the data ecosystem?
In your article that covers the trends you are keeping an eye on for 2020 you call out 4 in particular, data quality, data catalogs, observability of what influences critical business indicators, and streaming data. Taking those in turn:
What are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching?
What are the unsolved areas that you see as being viable for newcomers?
What are the challenges faced by businesses in establishing and maintaining data catalogs?
What approaches are being taken by the companies who are trying to solve this problem?
What shortcomings do you see in the available products?
For gaining visibility into the forces that impact the key performance indicators (KPI) of businesses, what is lacking in the current approaches?
What additional information needs to be tracked to provide the needed context for making informed decisions about what actions to take to improve KPIs?
What challenges do businesses in this observability space face to provide useful access and analysis to this collected data?
Streaming is an area that has been growing rapidly over the past few years, with many open source and commercial options. What are the major business opportunities that you see to make streaming more accessible and effective?
What are the main factors that you see as driving this growth in the need for access to streaming data?
With your focus on these trends, how does that influence your investment decisions and where you spend your time?
What are the unaddressed markets or product categories that you see which would be lucrative for new businesses?
In most areas of technology now there is a mix of open source and commercial solutions to any given problem, with varying levels of maturity and polish between them. What are your views on the balance of this relationship in the data ecosystem?
For data in particular, there is a strong potential for vendor lock-in which can cause potential customers to avoid adoption of commercial solutions. What has been your experience in that regard with the companies that you work with?
Contact Info
@AstasiaMyers on Twitter
@astasia on Medium
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Redpoint Ventures
4 Data Trends To Watch in 2020
Seagate
Western Digital
Pure Storage
Cisco
Cohesity
Looker
Podcast Episode
DGraph
Podcast Episode
Dremio
Podcast Episode
SnowflakeDB
Podcast Episode
Thoughspot
Tibco
Elastic
Splunk
Informatica
Data Council
DataCoral
Mattermost
Bitwarden
Snowplow
Podcast Interview
Interview About Snowplow Infrastructure
CHAOSSEARCH
Podcast Episode
Kafka Streams
Pulsar
Podcast Interview
Followup Podcast Interview
Soda
Toro
Great Expectations
Alation
Collibra
Amundsen
DataHub
Netflix Metacat
Marquez
Podcast Episode
LDAP == Lightweight Directory Access Protocol
Anodot
Databricks
Flink

Jun 2, 2020 • 56min
Building A Data Lake For The Database Administrator At Upsolver
Summary
Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing your definition of what a data lake is and what it is comprised of?
We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time?
How has Upsolver changed or evolved since we last spoke?
How has the evolution of the underlying technologies impacted your implementation and overall product strategy?
What are some of the common challenges that accompany a data lake implementation?
How do those challenges influence the adoption or viability of a data lake?
How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake?
What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway?
What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform?
How is the SQL layer in Upsolver implemented?
What are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.?
What are the main concepts that you need to educate your customers on?
What are some of the pitfalls that users should be aware of?
What features of your platform are often overlooked or underutilized which you think should be more widely adopted?
What have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver?
What do you have planned for the future?
Contact Info
Ori
LinkedIn
Yoni
yoniiny on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Upsolver
Podcast Episode
DBA == Database Administrator
IDF == Israel Defense Forces
Data Lake
Eventual Consistency
Apache Spark
Redshift Spectrum
Azure Synapse Analytics
SnowflakeDB
Podcast Episode
BigQuery
Presto
Podcast Episode
Apache Kafka
Cartesian Product
kSQLDB
Podcast Episode
Eventador
Podcast Episode
Materialize
Podcast Episode
Common Table Expressions
Lambda Architecture
Kappa Architecture
Apache Flink
Podcast Episode
Reinforcement Learning
Cloudformation
GDPR
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

May 25, 2020 • 47min
Mapping The Customer Journey For B2B Companies At Dreamdata
Summary
Gaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals involved and the myriad ways that they interface with the business. Dreamdata integrates data from the multitude of platforms that are used by these organizations so that they can get a comprehensive view of their customer lifecycle. In this episode Ole Dallerup explains how Dreamdata was started, how their platform is architected, and the challenges inherent to data management in the B2B space. This conversation is a useful look into how data engineering and analytics can have a direct impact on the success of the business.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
Your host is Tobias Macey and today I’m interviewing Ole Dallerup about Dreamdata, a platform for simplifying data integration for B2B companies
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Dreamata?
What was your inspiration for starting a company and what keeps you motivated?
How do the data requirements differ between B2C and B2B companies?
What are the challenges that B2B companies face in gaining visibility across the lifecycle of their customers?
How does that lack of visibility impact the viability or growth potential of the business?
What are the factors that contribute to silos in visibility of customer activity within a business?
What are the data sources that you are dealing with to generate meaningful analytics for your customers?
What are some of the challenges that business face in either generating or collecting useful information about their customer interactions?
How is the technical platform of Dreamdata implemented and how has it evolved since you first began working on it?
What are some of the ways that you approach entity resolution across the different channels and data sources?
How do you reconcile the information collected from different sources that might use disparate data formats and representations?
What is the onboarding process for your customers to identify and integrate with all of their systems?
How do you approach the definition of the schema model for the database that your customers implement for storing their footprint?
Do you allow for customization by the customer?
Do you rely on a tool such as DBT for populating the table definitions and transformations from the source data?
How do you approach representation of the analysis and actionable insights to your customers so that they are able to accurately intepret the results?
How have your own experiences at Dreamdata influenced the areas that you invest in for the product?
What are some of the most interesting or surprising insights that you have been able to gain as a result of the unified view that you are building?
What are some of the most challenging, interesting, or unexpected lessons that you have learned from building and growing the technical and business elements of Dreamdata?
When might a user be better served by building their own pipelines or analysis for tracking their customer interactions?
What do you have planned for the future of Dreamdata?
What are some of the industry trends that you are keeping an eye on and what potential impacts to your business do you anticipate?
Contact Info
LinkedIn
@oledallerup on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Dreamdata
Poker Tracker
TrustPilot
Zendesk
Salesforce
Hubspot
Google BigQuery
SnowflakeDB
Podcast Episode
AWS Redshift
Singer
Stitch Data
Dataform
Podcast Episode
DBT
Podcast Episode
Segment
Podcast Episode
Cloud Dataflow
Apache Beam
UTM Parameters
Clearbit
Capterra
G2 Crowd
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast


