

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

May 28, 2018 • 48min
The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33
Summary
Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service
Interview
Introduction
How did you get involved in the area of data management?
What is Alooma and what is the origin story?
How is the Alooma platform architected?
I want to go into stream VS batch here
What are the most challenging components to scale?
How do you manage the underlying infrastructure to support your SLA of 5 nines?
What are some of the complexities introduced by processing data from multiple customers with various compliance requirements?
How do you sandbox user’s processing code to avoid security exploits?
What are some of the potential pitfalls for automatic schema management in the target database?
Given the large number of integrations, how do you maintain the
What are some challenges when creating integrations, isn’t it simply conforming with an external API?
For someone getting started with Alooma what does the workflow look like?
What are some of the most challenging aspects of building and maintaining Alooma?
What are your plans for the future of Alooma?
Contact Info
LinkedIn
@yairwein on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Alooma
Convert Media
Data Integration
ESB (Enterprise Service Bus)
Tibco
Mulesoft
ETL (Extract, Transform, Load)
Informatica
Microsoft SSIS
OLAP Cube
S3
Azure Cloud Storage
Snowflake DB
Redshift
BigQuery
Salesforce
Hubspot
Zendesk
Spark
The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps
RDBMS (Relational Database Management System)
SaaS (Software as a Service)
Change Data Capture
Kafka
Storm
Google Cloud PubSub
Amazon Kinesis
Alooma Code Engine
Zookeeper
Idempotence
Kafka Streams
Kubernetes
SOC2
Jython
Docker
Python
Javascript
Ruby
Scala
PII (Personally Identifiable Information)
GDPR (General Data Protection Regulation)
Amazon EMR (Elastic Map Reduce)
Sequoia Capital
Lightspeed Investors
Redis
Aerospike
Cassandra
MongoDB
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

May 21, 2018 • 42min
PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32
Summary
Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Presto is?
What are some of the common use cases and deployment patterns for Presto?
How does Presto compare to Drill or Impala?
What is it about Presto that led you to building a business around it?
What are some of the most challenging aspects of running and scaling Presto?
For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?
How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?
What are some cases in which Presto is not the right solution?
What types of support have you found to be the most commonly requested?
What are some of the types of tooling or improvements that you have made to Presto in your distribution?
What are some of the notable changes that your team has contributed upstream to Presto?
Contact Info
Website
E-mail
Twitter – @starburstdata
Twitter – @prestodb
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Starburst Data
Presto
Hadapt
Hadoop
Hive
Teradata
PrestoCare
Cost Based Optimizer
ANSI SQL
Spill To Disk
Tempto
Benchto
Geospatial Functions
Cassandra
Accumulo
Kafka
Redis
PostGreSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rssSupport Data Engineering Podcast

May 14, 2018 • 26min
Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31
Summary
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Your host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal.
Interview
Andy Eschbacher From Carto
What are the challenges associated with storing geospatial data?
What are some of the common misconceptions that people have about working with geospatial data?
Contact Info
andy-esch on GitHub
@MrEPhysics on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Carto
Geospatial Analysis
GeoJSON
Todd Blaschka From TigerGraph
What are graph databases and how do they differ from relational engines?
What are some of the common difficulties that people have when deling with graph algorithms?
How does data modeling for graph databases differ from relational stores?
Contact Info
LinkedIn
@toddblaschka on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
TigerGraph
Graph Databases
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

May 7, 2018 • 33min
Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30
Summary
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.
Interview
Alan Anders from Applecart
What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities?
What are the biggest technical hurdles at Applecart?
Contact Info
@alanjanders on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Spark
DataBricks
DataBricks Delta
Applecart
Stepan Pushkarev from Hydrosphere.io
What is Hydropshere.io?
What metrics do you track to determine when a machine learning model is not producing an appropriate output?
How do you determine which data points to sample for retraining the model?
How does the role of a machine learning engineer differ from data engineers and data scientists?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Hydrosphere
Machine Learning Engineer
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Apr 30, 2018 • 45min
Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29
Summary
Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence
Interview
Introduction
How did you get involved in the area of data management?
The current goal for most companies is to be “data driven”. How would you define that concept?
How does Metabase assist in that endeavor?
What is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL?
What level of complexity is possible with the query builder?
What have you found to be the typical use cases for Metabase in the context of an organization?
How do you manage scaling for large or complex queries?
What was the motivation for using Clojure as the language for implementing Metabase?
What is involved in adding support for a new data source?
What are the differentiating features of Metabase that would lead someone to choose it for their organization?
What have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective?
What do you have planned for the future of Metabase?
Contact Info
Sameer
salsakran on GitHub
@sameer_alsakran on Twitter
LinkedIn
Metabase
Website
@metabase on Twitter
metabase on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Expa
Metabase
Blackjet
Hadoop
Imeem
Maslow’s Hierarchy of Data Needs
2 Sided Marketplace
Honeycomb Interview
Excel
Tableau
Go-JEK
Clojure
React
Python
Scala
JVM
Redash
How To Lie With Data
Stripe
Braintree Payments
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Apr 23, 2018 • 40min
Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28
Summary
The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management
Interview
Introduction
How did you get involved in the area of data management?
What is OctopAI and what was your motivation for founding it?
What are some of the types of information that you classify and collect as metadata?
Can you talk through the architecture of your platform?
What are some of the challenges that are typically faced by metadata management systems?
What is involved in deploying your metadata collection agents?
Once the metadata has been collected what are some of the ways in which it can be used?
What mechanisms do you use to ensure that customer data is segregated?
How do you identify and handle sensitive information during the collection step?
What are some of the most challenging aspects of your technical and business platforms that you have faced?
What are some of the plans that you have for OctopAI going forward?
Contact Info
Amnon
LinkedIn
@octopai_amnon on Twitter
OctopAI
@OctopaiBI on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
OctopAI
Metadata
Metadata Management
Data Integrity
CRM (Customer Relationship Management)
ERP (Enterprise Resource Planning)
Business Intelligence
ETL (Extract, Transform, Load)
Informatica
SAP
Data Governance
SSIS (SQL Server Integration Services)
Vertica
Airflow
Luigi
Oozie
GDPR (General Data Privacy Regulation)
Root Cause Analysis
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Apr 15, 2018 • 44min
Data Engineering Weekly with Joe Crobak - Episode 27
Summary
The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry.
Interview
Introduction
How did you get involved in the area of data management?
What are some of the projects that you have been involved in that were most personally fulfilling?
As an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data?
Healthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to?
What was your motivation for starting a newsletter about the Hadoop space?
Can you speak to your reasoning for the recent rebranding of the newsletter?
How much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it?
After over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments?
What have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged?
What is your workflow for finding and curating the content that goes into your newsletter?
What is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter?
How has your experience managing the newsletter influenced your areas of focus in your work and vice-versa?
What are your plans going forward?
Contact Info
Data Eng Weekly
Email
Twitter – @joecrobak
Twitter – @dataengweekly
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
USDS
National Labs
Cray
Amazon EMR (Elastic Map-Reduce)
Recommendation Engine
Netflix Prize
Hadoop
Cloudera
Puppet
healthcare.gov
Medicare
Quality Payment Program
HIPAA
NIST National Institute of Standards and Technology
PII (Personally Identifiable Information)
Threat Modeling
Apache JBoss
Apache Web Server
MarkLogic
JMS (Java Message Service)
Load Balancer
COBOL
Hadoop Weekly
Data Engineering Weekly
Foursquare
NiFi
Kubernetes
Spark
Flink
Stream Processing
DataStax
RSS
The Flavors of Data Science and Engineering
CQRS
Change Data Capture
Jay Kreps
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Apr 8, 2018 • 55min
Defining DataOps with Chris Bergh - Episode 26
Summary
Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Christopher Bergh about DataKitchen and the rise of DataOps
Interview
Introduction
How did you get involved in the area of data management?
How do you define DataOps?
How does it compare to the practices encouraged by the DevOps movement?
How does it relate to or influence the role of a data engineer?
How does a DataOps oriented workflow differ from other existing approaches for building data platforms?
One of the aspects of DataOps that you call out is the practice of providing multiple environments to provide a platform for testing the various aspects of the analytics workflow in a non-production context. What are some of the techniques that are available for managing data in appropriate volumes across those deployments?
The practice of testing logic as code is fairly well understood and has a large set of existing tools. What have you found to be some of the most effective methods for testing data as it flows through a system?
One of the practices of DevOps is to create feedback loops that can be used to ensure that business needs are being met. What are the metrics that you track in your platform to define the value that is being created and how the various steps in the workflow are proceeding toward that goal?
In order to keep feedback loops fast it is necessary for tests to run quickly. How do you balance the need for larger quantities of data to be used for verifying scalability/performance against optimizing for cost and speed in non-production environments?
How does the DataKitchen platform simplify the process of operationalizing a data analytics workflow?
As the need for rapid iteration and deployment of systems to capture, store, process, and analyze data becomes more prevalent how do you foresee that feeding back into the ways that the landscape of data tools are designed and developed?
Contact Info
LinkedIn
@ChrisBergh on Twitter
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
DataOps Manifesto
DataKitchen
2017: The Year Of DataOps
Air Traffic Control
Chief Data Officer (CDO)
Gartner
W. Edwards Deming
DevOps
Total Quality Management (TQM)
Informatica
Talend
Agile Development
Cattle Not Pets
IDE (Integrated Development Environment)
Tableau
Delphix
Dremio
Pachyderm
Continuous Delivery by Jez Humble and Dave Farley
SLAs (Service Level Agreements)
XKCD Image Recognition Comic
Airflow
Luigi
DataKitchen Documentation
Continuous Integration
Continous Delivery
Docker
Version Control
Git
Looker
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Apr 1, 2018 • 52min
ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25
Summary
Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack
Interview
Introduction
How did you get involved in the area of data management?
Why don’t you start by explaining what ThreatStack does?
What was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for?
Can you describe the type(s) of data that you collect and how it is structured?
What is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data?
How do you ensure a consistent format of the information that you receive?
How do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended?
How much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context?
I understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change?
How much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)?
How do you ensure the security and provenance of the data that you collect as it traverses your infrastructure?
What are some of the most common vulnerabilities that you detect in your client’s infrastructure?
For someone who wants to start using ThreatStack, what does the setup process look like?
What have you found to be the most challenging aspects of building and managing the data processes in your environment?
What are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure?
Contact Info
Pete Cheslock
@petecheslock on Twitter
Website
petecheslock on GitHub
Patrick Cable
@patcable on Twitter
Website
patcable on GitHub
ThreatStack
Website
@threatstack on Twitter
threatstack on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
ThreatStack
SecDevOps
Sonian
EC2
Snort
Snorby
Suricata
Tripwire
Syscall (System Call)
AuditD
CloudTrail
Naxsi
Cloud Native
File Integrity Monitoring (FIM)
Amazon Web Services (AWS)
RabbitMQ
ZeroMQ
Kafka
Spark
Slack
PagerDuty
JSON
Microservices
Cassandra
ElasticSearch
Sensu
Service Discovery
Honeypot
Kubernetes
PostGreSQL
Druid
Flink
Launch Darkly
Chef
Consul
Terraform
CloudFormation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Mar 25, 2018 • 33min
MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24
Summary
The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Christopher Ryan and Hitoshi Harada about MarketStore, a storage server for large volumes of financial timeseries data
Interview
Introduction
How did you get involved in the area of data management?
What was your motivation for creating MarketStore?
What are the characteristics of financial time series data that make it challenging to manage?
What are some of the workflows that MarketStore is used for at Alpaca and how were they managed before it was available?
With MarketStore’s data coming from multiple third party services, how are you managing to keep the DB up-to-date and in sync with those services?
What is the worst case scenario if there is a total failure in the data store?
What guards have you built to prevent such a situation from occurring?
Since MarketStore is used for querying and analyzing data having to do with financial markets and there are potentially large quantities of money being staked on the results of that analysis, how do you ensure that the operations being performed in MarketStore are accurate and repeatable?
What were the most challenging aspects of building MarketStore and integrating it into the rest of your systems?
Motivation for open sourcing the code?
What is the next planned major feature for MarketStore, and what use-case is it aiming to support?
Contact Info
Christopher
Email
Hitoshi
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
MarketStore
GitHub
Release Announcement
Alpaca
IBM
DB2
GreenPlum
Algorithmic Trading
Backtesting
OHLC (Open-High-Low-Close)
HDF5
Golang
C++
Timeseries Database List
InfluxDB
JSONRPC
Slait
CircleCI
GDAX
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast


