

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Sep 17, 2018 • 48min
Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48
Summary
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics
Interview
Introductions
How did you get involved in the area of data engineering and data management?
What is Snowplow Analytics and what problem were you trying to solve when you started the company?
What is unique about customer event data from an ingestion and processing perspective?
Challenges with properly matching up data between sources
Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?
Cleanliness/accuracy
What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?
Can you describe the overall architecture of the ingest pipeline that Snowplow provides?
How has that architecture evolved from when you first started?
What would you do differently if you were to start over today?
Ensuring appropriate use of enrichment sources
What have been some of the biggest challenges encountered while building and evolving Snowplow?
What are some of the most interesting uses of your platform that you are aware of?
Keep In Touch
Alex
@alexcrdean on Twitter
LinkedIn
Snowplow
@snowplowdata on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Snowplow
GitHub
Deloitte Consulting
OpenX
Hadoop
AWS
EMR (Elastic Map-Reduce)
Business Intelligence
Data Warehousing
Google Analytics
CRM (Customer Relationship Management)
S3
GDPR (General Data Protection Regulation)
Kinesis
Kafka
Google Cloud Pub-Sub
JSON-Schema
Iglu
IAB Bots And Spiders List
Heap Analytics
Podcast Interview
Redshift
SnowflakeDB
Snowplow Insights
Google Cloud Platform
Azure
GitLab
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Sep 10, 2018 • 48min
Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47
Summary
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?
What types of data are you focused on supporting?
What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?
Is there any need for an Elasticsearch cluster in addition to Chaos Search?
For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3?
What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL?
Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS?
What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster?
What is the system architecture that you have built to allow for querying terabytes of data in S3?
What are the biggest contributors to query latency and what have you done to mitigate them?
What are the options for access control when running queries against the data stored in S3?
What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen?
What are your plans for the future of Chaos Search?
Contact Info
Pete Cheslock
@petecheslock on Twitter
Website
Thomas Hazel
@thomashazel on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Chaos Search
AWS S3
Cassandra
Elasticsearch
Podcast Interview
PostgreSQL
Distributed Systems
Information Theory
Lucene
Inverted Index
Kibana
Logstash
NVMe
AWS KMS
Kinesis
FluentD
Parquet
Athena
Presto
Drill
Backblaze
OpenStack Swift
Minio
EMR
DataDog
NewRelic
Elastic Beats
Metricbeat
Graphite
Snappy
Scala
Akka
Elastalert
Tensorflow
X-Pack
Data Lake
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Sep 3, 2018 • 47min
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46
Summary
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms
Interview
Introduction
How did you get involved in the area of data management?
Can you start by establishing a definition of data mastering that we can work from?
How does the master data set get used within the overall analytical and processing systems of an organization?
What is the traditional workflow for creating a master data set?
What has changed in the current landscape of businesses and technology platforms that makes that approach impractical?
What are the steps that an organization can take to evolve toward an agile approach to data mastering?
At what scale of company or project does it makes sense to start building a master data set?
What are the limitations of using ML/AI to merge data sets?
What are the limitations of a golden master data set in practice?
Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them?
Are there specific problem domains that are more likely to benefit from a master data set?
Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data)
What storage mechanisms are typically used for managing a master data set?
Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure?
How do you manage latency issues when trying to reference the same entities from multiple disparate systems?
What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?
What suggestions do you have to help prevent such a project from being derailed?
What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Tamr
Multi-Dimensional Database
Master Data Management
ETL
EDW (Enterprise Data Warehouse)
Waterfall Development Method
Agile Development Method
DataOps
Feature Engineering
Tableau
Qlik
Data Catalog
PowerBI
RDBMS (Relational Database Management System)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Aug 27, 2018 • 25min
Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45
Summary
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use
Interview
Introduction
How did you get involved in the area of data security?
Can you start by explaining what your mission is with Enveil and how the company got started?
One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?
What are some of the challenges associated with scaling homomorphic encryption?
What are some difficulties associated with working on encrypted data sets?
Can you describe the underlying architecture for your data platform?
How has that architecture evolved from when you first began building it?
What are some use cases that are unlocked by having a fully encrypted data platform?
For someone using the Enveil platform, what does their workflow look like?
A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors?
What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns)
What do you have planned for the future of Enveil?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data security today?
Links
Enveil
NSA
GDPR
Intellectual Property
Zero Trust
Homomorphic Encryption
Ciphertext
Hadoop
PII (Personally Identifiable Information)
TLS (Transport Layer Security)
Spark
Elasticsearch
Side-channel attacks
Spectre and Meltdown
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Aug 20, 2018 • 43min
Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44
Manish Jain, Creator of DGraph, discusses the benefits of storing and querying data as a graph, how DGraph overcomes limitations, building a distributed, consistent database, and the use case of integrating 51 data silos into a single database cluster.

Aug 13, 2018 • 48min
Putting Airflow Into Production With James Meickle - Episode 43
Summary
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation
Interview
Introduction
How did you get involved in the area of data management?
What was your initial project requirement?
What tooling did you consider in addition to Airflow?
What aspects of the Airflow platform led you to choose it as your implementation target?
Can you describe your current deployment architecture?
How many engineers are involved in writing tasks for your Airflow installation?
What resources were the most helpful while learning about Airflow design patterns?
How have you architected your DAGs for deployment and extensibility?
What kinds of tests and automation have you put in place to support the ongoing stability of your deployment?
What are some of the dead-ends or other pitfalls that you encountered during the course of this project?
What aspects of Airflow have you found to be lacking that you would like to see improved?
What did you wish someone had told you before you started work on your Airflow installation?
If you were to start over would you make the same choice?
If Airflow wasn’t available what would be your second choice?
What are your next steps for improvements and fixes?
Contact Info
@eronarn on Twitter
Website
eronarn on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Quantopian
Harvard Brain Science Initiative
DevOps Days Boston
Google Maps API
Cron
ETL (Extract, Transform, Load)
Azkaban
Luigi
AWS Glue
Airflow
Pachyderm
Podcast Interview
AirBnB
Python
YAML
Ansible
REST (Representational State Transfer)
SAML (Security Assertion Markup Language)
RBAC (Role-Based Access Control)
Maxime Beauchemin
Medium Blog
Celery
Dask
Podcast Interview
PostgreSQL
Podcast Interview
Redis
Cloudformation
Jupyter Notebook
Qubole
Astronomer
Podcast Interview
Gunicorn
Kubernetes
Airflow Improvement Proposals
Python Enhancement Proposals (PEP)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Aug 6, 2018 • 56min
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42
Jonathan Katz, expert in PostgreSQL and its extensibility, gives a comprehensive overview of PostgreSQL. He discusses its history, highlighting its adaptability and longevity. Katz also talks about the significance of logical replication, leveraging Postgres features for application development, and upcoming projects and improvements in version 12. The conversation covers topics like security, authentication methods, access control, and the importance of education in data management.

Jul 30, 2018 • 29min
Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41
Summary
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy
Interview
Introduction
How did you get involved in the area of data management?
What is Ona and how did the company get started?
What are some examples of the types of customers that you work with?
What types of data do you support in your collection platform?
What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?
Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?
What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?
Can you describe the flow of the data from collection through to analysis?
To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?
What are the architectural considerations that you factored in when designing it?
What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?
What are your plans for the future of Ona and Canopy?
Contact Info
Email
pld on Github
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
OpenSRP
Ona
Canopy
Open Data Kit
Earth Institute at Columbia University
Sustainable Engineering Lab
WHO
Bill and Melinda Gates Foundation
XLSForms
PostGIS
Kafka
Druid
Superset
Postgres
Ansible
Docker
Terraform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jul 16, 2018 • 49min
Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40
Summary
When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a file system interface.
Interview
Introduction
How did you get involved in the area of data management?
Can you start with an overview of what Ceph is?
What was the motivation for starting the project?
What are some of the most common use cases for Ceph?
There are a large variety of distributed file systems. How would you characterize Ceph as it compares to other options (e.g. HDFS, GlusterFS, LionFS, SeaweedFS, etc.)?
Given that there is no single point of failure, what mechanisms do you use to mitigate the impact of network partitions?
What mechanisms are available to ensure data integrity across the cluster?
How is Ceph implemented and how has the design evolved over time?
What is required to deploy and manage a Ceph cluster?
What are the scaling factors for a cluster?
What are the limitations?
How does Ceph handle mixed write workloads with either a high volume of small files or a smaller volume of larger files?
In services such as S3 the data is segregated from block storage options like EBS or EFS. Since Ceph provides all of those interfaces in one project is it possible to use each of those interfaces to the same data objects in a Ceph cluster?
In what situations would you advise someone against using Ceph?
What are some of the most interested, unexpected, or challenging aspects of working with Ceph and the community?
What are some of the plans that you have for the future of Ceph?
Contact Info
Email
@liewegas on Twitter
liewegas on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Ceph
Red Hat
DreamHost
UC Santa Cruz
Los Alamos National Labs
Dream Objects
OpenStack
Proxmox
POSIX
GlusterFS
Hadoop
Ceph Architecture
Paxos
relatime
Prometheus
Zabbix
Kubernetes
NVMe
DNS-SD
Consul
EtcD
DNS SRV Record
Zeroconf
Bluestore
XFS
Erasure Coding
NFS
Seastar
Rook
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jul 8, 2018 • 1h 4min
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39
Summary
Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what NiFi is?
What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code?
How did you get involved with the project?
Where does it sit in the broader landscape of data tools?
Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?
How do you manage versioning and backup of data flows, as well as promoting them between environments?
One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?
What types of reporting are available across this information?
What are some of the use cases or requirements that lend themselves well to being solved by NiFi?
When is NiFi the wrong choice?
What is involved in deploying and scaling a NiFi installation?
What are some of the system/network parameters that should be considered?
What are the scaling limitations?
What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community?
What do you have planned for the future of NiFi?
Contact Info
Kevin Doran
@kevdoran on Twitter
Email
Andy LoPresto
@yolopey on Twitter
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
NiFi
HortonWorks DataFlow
HortonWorks
Apache Software Foundation
Apple
CSV
XML
JSON
Perl
Python
Internet Scale
Asset Management
Documentum
DataFlow
NSA (National Security Agency)
24 (TV Show)
Technology Transfer Program
Agile Software Development
Waterfall
Spark
Flink
Kafka
Oozie
Luigi
Airflow
FluentD
ETL (Extract, Transform, and Load)
ESB (Enterprise Service Bus)
MiNiFi
Java
C++
Provenance
Kubernetes
Apache Atlas
Data Governance
Kibana
K-Nearest Neighbors
DevOps
DSL (Domain Specific Language)
NiFi Registry
Artifact Repository
Nexus
NiFi CLI
Maven Archetype
IoT
Docker
Backpressure
NiFi Wiki
TLS (Transport Layer Security)
Mozilla TLS Observatory
NiFi Flow Design System
Data Lineage
GDPR (General Data Protection Regulation)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast