

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Feb 19, 2018 • 29min
Data Teams with Will McGinnis - Episode 19
Summary
The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists
Interview
Introduction
How did you get involved in the area of data management?
The terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms?
What parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators?
Is there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team?
What are the benefits of splitting the responsibilities of data engineering and data science?
What are the disadvantages?
What are some strategies to ensure successful interaction between data engineers and data scientists?
How do you view these roles evolving as they become more prevalent across companies and industries?
Contact Info
Website
wdm0006 on GitHub
@willmcginniser on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Blog Post: Tendencies of Data Engineers and Data Scientists
Predikto
Categorical Encoders
DevOps
SciKit-Learn
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Feb 11, 2018 • 1h 3min
TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18
Ajay Kulkarni and Mike Freedman, co-founders of TimescaleDB, discuss the origins and challenges of building a scalable time series database. They explain how TimescaleDB handles out-of-order data and infrequent sensor connections. They also share insights into marketing and business aspects, including the decision to release the code base as open source, future plans for the enterprise version, and the support and investment structure for the open source business model.

Feb 4, 2018 • 54min
Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17
Summary
One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Pulsar is and what the original inspiration for the project was?
What have been some of the most challenging aspects of building and promoting Pulsar?
For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components?
What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to?
What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison?
The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka?
One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged?
When is Pulsar the wrong tool to use?
What are some of the improvements or new features that you have planned for the future of Pulsar?
Contact Info
Matteo
merlimat on GitHub
@merlimat on Twitter
Rajan
@dhabaliaraj on Twitter
rhabalia on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Pulsar
Publish-Subscribe
Yahoo
Streamlio
ActiveMQ
Kafka
Bookkeeper
SLA (Service Level Agreement)
Write-Ahead Log
Ansible
Zookeeper
Pulsar Deployment Instructions
RabbitMQ
Confluent Schema Registry
Podcast Interview
Kafka Connect
Wallaroo
Podcast Interview
Kinesis
Athenz
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jan 29, 2018 • 1h 3min
Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16
Summary
Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future
Interview
Introduction
How did you get involved in the area of data management?
What is the Dat project and how did it get started?
How have the grants to the Dat project influenced the focus and pace of development that was possible?
Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?
Can you explain how the Dat protocol is designed and how it has evolved since it was first started?
How does Dat manage conflict resolution and data versioning when replicating between multiple machines?
One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions?
One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made?
How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases?
What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default?
For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network?
What have been the most challenging aspects of building and promoting Dat?
What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?
Contact Info
Dat
datproject.org
Email
@dat_project on Twitter
Dat Chat
Danielle
Email
@daniellecrobins
Joe
Email
@joeahand on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Dat Project
Code For Science and Society
Neuroscience
Cell Biology
OpenCon
Mozilla Science
Open Education
Open Access
Open Data
Fortune 500
Data Warehouse
Knight Foundation
Alfred P. Sloan Foundation
Gordon and Betty Moore Foundation
Dat In The Lab
Dat in the Lab blog posts
California Digital Library
IPFS
Dat on Open Collective – COMING SOON!
ScienceFair
Stencila
eLIFE
Git
BitTorrent
Dat Whitepaper
Merkle Tree
Certificate Transparency
Dat Protocol Working Group
Dat Multiwriter Development – Hyperdb
Beaker Browser
WebRTC
IndexedDB
Rust
C
Keybase
PGP
Wire
Zenodo
Dryad Data Sharing
Dataverse
RSync
FTP
Globus
Fritter
Fritter Demo
Rotonde how to
Joe’s website on Dat
Dat Tutorial
Data Rescue – NYTimes Coverage
Data.gov
Libraries+ Network
UC Conservation Genomics Consortium
Fair Data principles
hypervision
hypervision in browser
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Click here to read the unedited transcript…
Tobias Macey 00:13…

Jan 22, 2018 • 37min
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15
Summary
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel and Dark Data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing your definition of dark data and how Snorkel helps to extract value from it?
What are some of the most challenging aspects of building labelling functions and what tools or techniques are available to verify their validity and effectiveness in producing accurate outcomes?
Can you provide some examples of how Snorkel can be used to build useful models in production contexts for companies or problem domains where data collection is difficult to do at large scale?
For someone who wants to use Snorkel, what are the steps involved in processing the source data and what tooling or systems are necessary to analyse the outputs for generating usable insights?
How is Snorkel architected and how has the design evolved over its lifetime?
What are some situations where Snorkel would be poorly suited for use?
What are some of the most interesting applications of Snorkel that you are aware of?
What are some of the other projects that you and your group are working on that interact with Snorkel?
What are some of the features or improvements that you have planned for future releases of Snorkel?
Contact Info
Website
ajratner on Github
@ajratner on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Stanford
DAWN
HazyResearch
Snorkel
Christopher Ré
Dark Data
DARPA
Memex
Training Data
FDA
ImageNet
National Library of Medicine
Empirical Studies of Conflict
Data Augmentation
PyTorch
Tensorflow
Generative Model
Discriminative Model
Weak Supervision
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jan 15, 2018 • 46min
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14
Summary
As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Christopher Meiklejohn about establishing consensus in distributed systems
Interview
Introduction
How did you get involved in the area of data management?
You have dealt with CRDTs with your work in industry, as well as in your research. Can you start by explaining what a CRDT is, how you first began working with them, and some of their current manifestations?
Other than CRDTs, what are some of the methods for establishing consensus across nodes in a system and how does increased scale affect their relative effectiveness?
One of the projects that you have been involved in which relies on CRDTs is LASP. Can you describe what LASP is and what your role in the project has been?
Can you provide examples of some production systems or available tools that are leveraging CRDTs?
If someone wants to take advantage of CRDTs in their applications or data processing, what are the available off-the-shelf options, and what would be involved in implementing custom data types?
What areas of research are you most excited about right now?
Given that you are currently working on your PhD, do you have any thoughts on the projects or industries that you would like to be involved in once your degree is completed?
Contact Info
Website
cmeiklejohn on GitHub
Google Scholar Citations
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Basho
Riak
Syncfree
LASP
CRDT
Mesosphere
CAP Theorem
Cassandra
DynamoDB
Bayou System (Xerox PARC)
Multivalue Register
Paxos
RAFT
Byzantine Fault Tolerance
Two Phase Commit
Spanner
ReactiveX
Tensorflow
Erlang
Docker
Kubernetes
Erleans
Orleans
Atom Editor
Automerge
Martin Klepman
Akka
Delta CRDTs
Antidote DB
Kops
Eventual Consistency
Causal Consistency
ACID Transactions
Joe Hellerstein
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jan 8, 2018 • 47min
Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13
Ozgun Erdogan and Craig Kerstiens from Citus Data discuss their work on scaling out PostGreSQL, including replication models, distributed backups, and upcoming features for real-time analytics. They also explore the considerations for deploying Citus and compare it to other offerings like Redshift and BigQuery.

Dec 25, 2017 • 59min
Wallaroo with Sean T. Allen - Episode 12
Summary
Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale
Interview
Introduction
How did you get involved in the area of data engineering?
What is Wallaroo and how did the project get started?
What is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on?
Why did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented?
How is Wallaroo architected internally to allow for distributed state management?
Is the state persistent, or is it only maintained long enough to complete the desired computation?
If so, what format do you use for long term storage of the data?
What have been the most challenging aspects of building the Wallaroo platform?
Which axes of the CAP theorem have you optimized for?
For someone who wants to build an application on top of Wallaroo, what is involved in getting started?
Once you have a working application, what resources are necessary for deploying to production and what are the scaling factors?
What are the failure modes that users of Wallaroo need to account for in their application or infrastructure?
What are some situations or problem types for which Wallaroo would be the wrong choice?
What are some of the most interesting or unexpected uses of Wallaroo that you have seen?
What do you have planned for the future of Wallaroo?
Contact Info
IRC
Mailing List
Wallaroo Labs Twitter
Email
Personal Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Wallaroo Labs
Storm Applied
Apache Storm
Risk Analysis
Pony Language
Erlang
Akka
Tail Latency
High Performance Computing
Python
Apache Software Foundation
Beyond Distributed Transactions: An Apostate’s View
Consistent Hashing
Jepsen
Lineage Driven Fault Injection
Chaos Engineering
QCon 2016 Talk
Codemesh in London: How did I get here?
CAP Theorem
CRDT
Sync Free Project
Basho
Wallaroo on GitHub
Docker
Puppet
Chef
Ansible
SaltStack
Kafka
TCP
Dask
Data Engineering Episode About Dask
Beowulf Cluster
Redis
Flink
Haskell
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Dec 18, 2017 • 34min
SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11
Summary
Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Jeroen van der Heijden about SiriDB, a next generation time series database
Interview
Introduction
How did you get involved in the area of data engineering?
What is SiriDB and how did the project get started?
What was the inspiration for the name?
What was the landscape of time series databases at the time that you first began work on Siri?
How does Siri compare to other time series databases such as InfluxDB, Timescale, KairosDB, etc.?
What do you view as the competition for Siri?
How is the server architected and how has the design evolved over the time that you have been working on it?
Can you describe how the clustering mechanism functions?
Is it possible to create pools with more than two servers?
What are the failure modes for SiriDB and where does it fall on the spectrum for the CAP theorem?
In the documentation it mentions needing to specify the retention period for the shards when creating a database. What is the reasoning for that and what happens to the individual metrics as they age beyond that time horizon?
One of the common difficulties when using a time series database in an operations context is the need for high cardinality of the metrics. How are metrics identified in Siri and is there any support for tagging?
What have been the most challenging aspects of building Siri?
In what situations or environments would you advise against using Siri?
Contact Info
joente on Github
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
SiriDB
Oversight
InfluxDB
LevelDB
OpenTSDB
Timescale DB
KairosDB
Write Ahead Log
Grafana
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Dec 10, 2017 • 49min
Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10
Summary
To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry
Interview
Introduction
How did you get involved in the area of data engineering?
What is the schema registry and what was the motivating factor for building it?
If you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas?
How did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options?
Conversely, what would be involved in using a storage backend other than Kafka?
What are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure?
What are some of the biggest challenges that you faced while designing and building the schema registry?
What is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations?
What are some of the features or enhancements that you have in mind for future work?
Contact Info
ewencp on GitHub
Website
@ewencp on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Kafka
Confluent
Schema Registry
Second Life
Eve Online
Yes, Virginia, You Really Do Need a Schema Registry
JSON-Schema
Parquet
Avro
Thrift
Protocol Buffers
Zookeeper
Kafka Connect
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast