

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Aug 13, 2018 • 48min
Putting Airflow Into Production With James Meickle - Episode 43
Summary
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation
Interview
Introduction
How did you get involved in the area of data management?
What was your initial project requirement?
What tooling did you consider in addition to Airflow?
What aspects of the Airflow platform led you to choose it as your implementation target?
Can you describe your current deployment architecture?
How many engineers are involved in writing tasks for your Airflow installation?
What resources were the most helpful while learning about Airflow design patterns?
How have you architected your DAGs for deployment and extensibility?
What kinds of tests and automation have you put in place to support the ongoing stability of your deployment?
What are some of the dead-ends or other pitfalls that you encountered during the course of this project?
What aspects of Airflow have you found to be lacking that you would like to see improved?
What did you wish someone had told you before you started work on your Airflow installation?
If you were to start over would you make the same choice?
If Airflow wasn’t available what would be your second choice?
What are your next steps for improvements and fixes?
Contact Info
@eronarn on Twitter
Website
eronarn on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Quantopian
Harvard Brain Science Initiative
DevOps Days Boston
Google Maps API
Cron
ETL (Extract, Transform, Load)
Azkaban
Luigi
AWS Glue
Airflow
Pachyderm
Podcast Interview
AirBnB
Python
YAML
Ansible
REST (Representational State Transfer)
SAML (Security Assertion Markup Language)
RBAC (Role-Based Access Control)
Maxime Beauchemin
Medium Blog
Celery
Dask
Podcast Interview
PostgreSQL
Podcast Interview
Redis
Cloudformation
Jupyter Notebook
Qubole
Astronomer
Podcast Interview
Gunicorn
Kubernetes
Airflow Improvement Proposals
Python Enhancement Proposals (PEP)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Aug 6, 2018 • 56min
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42
Jonathan Katz, expert in PostgreSQL and its extensibility, gives a comprehensive overview of PostgreSQL. He discusses its history, highlighting its adaptability and longevity. Katz also talks about the significance of logical replication, leveraging Postgres features for application development, and upcoming projects and improvements in version 12. The conversation covers topics like security, authentication methods, access control, and the importance of education in data management.

Jul 30, 2018 • 29min
Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41
Summary
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy
Interview
Introduction
How did you get involved in the area of data management?
What is Ona and how did the company get started?
What are some examples of the types of customers that you work with?
What types of data do you support in your collection platform?
What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?
Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?
What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?
Can you describe the flow of the data from collection through to analysis?
To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?
What are the architectural considerations that you factored in when designing it?
What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?
What are your plans for the future of Ona and Canopy?
Contact Info
Email
pld on Github
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
OpenSRP
Ona
Canopy
Open Data Kit
Earth Institute at Columbia University
Sustainable Engineering Lab
WHO
Bill and Melinda Gates Foundation
XLSForms
PostGIS
Kafka
Druid
Superset
Postgres
Ansible
Docker
Terraform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jul 16, 2018 • 49min
Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40
Summary
When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a file system interface.
Interview
Introduction
How did you get involved in the area of data management?
Can you start with an overview of what Ceph is?
What was the motivation for starting the project?
What are some of the most common use cases for Ceph?
There are a large variety of distributed file systems. How would you characterize Ceph as it compares to other options (e.g. HDFS, GlusterFS, LionFS, SeaweedFS, etc.)?
Given that there is no single point of failure, what mechanisms do you use to mitigate the impact of network partitions?
What mechanisms are available to ensure data integrity across the cluster?
How is Ceph implemented and how has the design evolved over time?
What is required to deploy and manage a Ceph cluster?
What are the scaling factors for a cluster?
What are the limitations?
How does Ceph handle mixed write workloads with either a high volume of small files or a smaller volume of larger files?
In services such as S3 the data is segregated from block storage options like EBS or EFS. Since Ceph provides all of those interfaces in one project is it possible to use each of those interfaces to the same data objects in a Ceph cluster?
In what situations would you advise someone against using Ceph?
What are some of the most interested, unexpected, or challenging aspects of working with Ceph and the community?
What are some of the plans that you have for the future of Ceph?
Contact Info
Email
@liewegas on Twitter
liewegas on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Ceph
Red Hat
DreamHost
UC Santa Cruz
Los Alamos National Labs
Dream Objects
OpenStack
Proxmox
POSIX
GlusterFS
Hadoop
Ceph Architecture
Paxos
relatime
Prometheus
Zabbix
Kubernetes
NVMe
DNS-SD
Consul
EtcD
DNS SRV Record
Zeroconf
Bluestore
XFS
Erasure Coding
NFS
Seastar
Rook
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jul 8, 2018 • 1h 4min
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39
Summary
Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what NiFi is?
What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code?
How did you get involved with the project?
Where does it sit in the broader landscape of data tools?
Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?
How do you manage versioning and backup of data flows, as well as promoting them between environments?
One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?
What types of reporting are available across this information?
What are some of the use cases or requirements that lend themselves well to being solved by NiFi?
When is NiFi the wrong choice?
What is involved in deploying and scaling a NiFi installation?
What are some of the system/network parameters that should be considered?
What are the scaling limitations?
What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community?
What do you have planned for the future of NiFi?
Contact Info
Kevin Doran
@kevdoran on Twitter
Email
Andy LoPresto
@yolopey on Twitter
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
NiFi
HortonWorks DataFlow
HortonWorks
Apache Software Foundation
Apple
CSV
XML
JSON
Perl
Python
Internet Scale
Asset Management
Documentum
DataFlow
NSA (National Security Agency)
24 (TV Show)
Technology Transfer Program
Agile Software Development
Waterfall
Spark
Flink
Kafka
Oozie
Luigi
Airflow
FluentD
ETL (Extract, Transform, and Load)
ESB (Enterprise Service Bus)
MiNiFi
Java
C++
Provenance
Kubernetes
Apache Atlas
Data Governance
Kibana
K-Nearest Neighbors
DevOps
DSL (Domain Specific Language)
NiFi Registry
Artifact Repository
Nexus
NiFi CLI
Maven Archetype
IoT
Docker
Backpressure
NiFi Wiki
TLS (Transport Layer Security)
Mozilla TLS Observatory
NiFi Flow Design System
Data Lineage
GDPR (General Data Protection Regulation)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jul 2, 2018 • 46min
Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38
Summary
Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale
Interview
Introduction
How did you get involved in the area of data management?
To start, can you explain the problem space that Alegion is targeting and how you operate?
When is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects?
What are some of the biggest challenges associated with managing human input to data sets intended for machine usage?
For someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like?
What tools and processes do you have in place to ensure the accuracy of their inputs?
How do you prevent bad actors from contributing data that would compromise the trained model?
What are the limitations of crowd-sourced data labels?
When is it beneficial to incorporate domain experts in the process?
When doing data collection from various sources, how do you ensure that intellectual property rights are respected?
How do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers?
What kinds of metadata do you track and how is that recorded/transmitted?
Do you think that human intelligence will be a necessary piece of ML/AI forever?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Alegion
University of Texas at Austin
Cognitive Science
Labeled Data
Mechanical Turk
Computer Vision
Sentiment Analysis
Speech Recognition
Taxonomy
Feature Engineering
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jun 25, 2018 • 42min
Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37
Summary
Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data
Interview
Introduction
How did you get involved in the area of data management?
What is the intended use case for Quilt and how did the project get started?
Can you step through a typical workflow of someone using Quilt?
How does that change as you go from a single user to a team of data engineers and data scientists?
Can you describe the elements of what a data package consists of?
What was your criteria for the file formats that you chose?
How is Quilt architected and what have been the most significant changes or evolutions since you first started?
How is the data registry implemented?
What are the limitations or edge cases that you have run into?
What optimizations have you made to accelerate synchronization of the data to and from the repository?
What are the limitations in terms of data volume, format, or usage?
What is your goal with the business that you have built around the project?
What are your plans for the future of Quilt?
Contact Info
Email
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Quilt Data
GitHub
Jobs
Reproducible Data Dependencies in Jupyter
Reproducible Machine Learning with Jupyter and Quilt
Allen Institute: Programmatic Data Access with Quilt
Quilt Example: MissingNo
Oracle
Pandas
Jupyter
Ycombinator
Data.World
Podcast Episode with CTO Bryon Jacob
Kaggle
Parquet
HDF5
Arrow
PySpark
Excel
Scala
Binder
Merkle Tree
Allen Institute for Cell Science
Flask
PostGreSQL
Docker
Airflow
Quilt Teams
Hive
Hive Metastore
PrestoDB
Podcast Episode
Netflix Iceberg
Kubernetes
Helm
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jun 17, 2018 • 45min
User Analytics In Depth At Heap with Dan Robinson - Episode 36
Summary
Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving a brief overview of Heap?
One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data?
Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there?
Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?
What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?
What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?
What challenges does that pose in your processing architecture?
What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?
How has that architecture changed or evolved over the life of the company?
What are some changes that you are anticipating in the near future?
Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails?
What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap?
What changes have been necessary as a result of GDPR?
What are your plans for the future of Heap?
Contact Info
@danlovesproofs on twitter
dan@drob.us
@drob on github
heapanalytics.com / @heap on twitter
https://heapanalytics.com/blog/category/engineering?utm_source=rss&utm_medium=rss
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Heap
Palantir
User Analytics
Google Analytics
Piwik
Mixpanel
Hubspot
Jepsen
Chaos Engineering
Node.js
Kafka
Scala
Citus
React
MobX
Redshift
Heap SQL
BigQuery
Webhooks
Drip
Data Virtualization
DNS
PII
SOC2
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jun 11, 2018 • 44min
CockroachDB In Depth with Peter Mattis - Episode 35
Summary
With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services
Interview
Introduction
How did you get involved in the area of data management?
What was the motivation for creating CockroachDB and building a business around it?
Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions?
What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions?
What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism?
Go is an unconventional language for building a database. What are the pros and cons of that choice?
What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it?
What are the edge cases and failure modes that users should be aware of?
I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB?
What are some examples of extensions that are specific to CockroachDB?
What are some of the most interesting uses of CockroachDB that you have seen?
When is CockroachDB the wrong choice?
What do you have planned for the future of CockroachDB?
Contact Info
Peter
LinkedIn
petermattis on GitHub
@petermattis on Twitter
Cockroach Labs
@CockroackDB on Twitter
Website
cockroachdb on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
CockroachDB
Cockroach Labs
SQL
Google Bigtable
Spanner
NoSQL
RDBMS (Relational Database Management System)
“Big Iron” (colloquial term for mainframe computers)
RAFT Consensus Algorithm
Consensus
MVCC (Multiversion Concurrency Control)
Isolation
Etcd
GDPR
Golang
C++
Garbage Collection
Metaprogramming
Rust
Static Linking
Docker
Kubernetes
CAP Theorem
PostGreSQL
ORM (Object Relational Mapping)
Information Schema
PG Catalog
Interleaved Tables
Vertica
Spark
Change Data Capture
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jun 4, 2018 • 40min
ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34
Summary
Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.
Interview
Introduction
How did you get involved in the area of data management?
Can you give a high level description of what ArangoDB is and the motivation for creating it?
What is the story behind the name?
How is ArangoDB constructed?
How does the underlying engine store the data to allow for the different ways of viewing it?
What are some of the benefits of multi-model data storage?
When does it become problematic?
For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango?
How does it compare to OrientDB?
What are the options for scaling a running system?
What are the limitations in terms of network architecture or data volumes?
One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?
What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code?
What are some of the most interesting or surprising uses of this functionality that you have seen?
What are some of the most challenging technical and business aspects of building and promoting ArangoDB?
What do you have planned for the future of ArangoDB?
Contact Info
Jan Steemann
jsteemann on GitHub
@steemann on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
ArangoDB
Köln
Multi-model Database
Graph Algorithms
Apache 2
C++
ArangoDB Foxx
Raft Protocol
Target Partners
RocksDB
AQL (ArangoDB Query Language)
OrientDB
PostGreSQL
OrientDB Studio
Google Spanner
3-Tier Architecture
Thomson-Reuters
Arango Search
Dell EMC
Google S2 Index
ArangoDB Geographic Functionality
JSON Schema
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast


