Data Engineering Podcast

Tobias Macey
undefined
Jul 2, 2018 • 46min

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Summary Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale Interview Introduction How did you get involved in the area of data management? To start, can you explain the problem space that Alegion is targeting and how you operate? When is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects? What are some of the biggest challenges associated with managing human input to data sets intended for machine usage? For someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like? What tools and processes do you have in place to ensure the accuracy of their inputs? How do you prevent bad actors from contributing data that would compromise the trained model? What are the limitations of crowd-sourced data labels? When is it beneficial to incorporate domain experts in the process? When doing data collection from various sources, how do you ensure that intellectual property rights are respected? How do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers? What kinds of metadata do you track and how is that recorded/transmitted? Do you think that human intelligence will be a necessary piece of ML/AI forever? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alegion University of Texas at Austin Cognitive Science Labeled Data Mechanical Turk Computer Vision Sentiment Analysis Speech Recognition Taxonomy Feature Engineering The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Jun 25, 2018 • 42min

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

Summary Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data Interview Introduction How did you get involved in the area of data management? What is the intended use case for Quilt and how did the project get started? Can you step through a typical workflow of someone using Quilt? How does that change as you go from a single user to a team of data engineers and data scientists? Can you describe the elements of what a data package consists of? What was your criteria for the file formats that you chose? How is Quilt architected and what have been the most significant changes or evolutions since you first started? How is the data registry implemented? What are the limitations or edge cases that you have run into? What optimizations have you made to accelerate synchronization of the data to and from the repository? What are the limitations in terms of data volume, format, or usage? What is your goal with the business that you have built around the project? What are your plans for the future of Quilt? Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Quilt Data GitHub Jobs Reproducible Data Dependencies in Jupyter Reproducible Machine Learning with Jupyter and Quilt Allen Institute: Programmatic Data Access with Quilt Quilt Example: MissingNo Oracle Pandas Jupyter Ycombinator Data.World Podcast Episode with CTO Bryon Jacob Kaggle Parquet HDF5 Arrow PySpark Excel Scala Binder Merkle Tree Allen Institute for Cell Science Flask PostGreSQL Docker Airflow Quilt Teams Hive Hive Metastore PrestoDB Podcast Episode Netflix Iceberg Kubernetes Helm The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Jun 17, 2018 • 45min

User Analytics In Depth At Heap with Dan Robinson - Episode 36

Summary Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data Interview Introduction How did you get involved in the area of data management? Can you start by giving a brief overview of Heap? One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data? Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there? Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information? What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage? What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations? What challenges does that pose in your processing architecture? What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data? How has that architecture changed or evolved over the life of the company? What are some changes that you are anticipating in the near future? Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails? What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap? What changes have been necessary as a result of GDPR? What are your plans for the future of Heap? Contact Info @danlovesproofs on twitter dan@drob.us @drob on github heapanalytics.com / @heap on twitter https://heapanalytics.com/blog/category/engineering?utm_source=rss&utm_medium=rss Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Heap Palantir User Analytics Google Analytics Piwik Mixpanel Hubspot Jepsen Chaos Engineering Node.js Kafka Scala Citus React MobX Redshift Heap SQL BigQuery Webhooks Drip Data Virtualization DNS PII SOC2 The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Jun 11, 2018 • 44min

CockroachDB In Depth with Peter Mattis - Episode 35

Summary With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services Interview Introduction How did you get involved in the area of data management? What was the motivation for creating CockroachDB and building a business around it? Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions? What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions? What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism? Go is an unconventional language for building a database. What are the pros and cons of that choice? What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it? What are the edge cases and failure modes that users should be aware of? I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB? What are some examples of extensions that are specific to CockroachDB? What are some of the most interesting uses of CockroachDB that you have seen? When is CockroachDB the wrong choice? What do you have planned for the future of CockroachDB? Contact Info Peter LinkedIn petermattis on GitHub @petermattis on Twitter Cockroach Labs @CockroackDB on Twitter Website cockroachdb on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links CockroachDB Cockroach Labs SQL Google Bigtable Spanner NoSQL RDBMS (Relational Database Management System) “Big Iron” (colloquial term for mainframe computers) RAFT Consensus Algorithm Consensus MVCC (Multiversion Concurrency Control) Isolation Etcd GDPR Golang C++ Garbage Collection Metaprogramming Rust Static Linking Docker Kubernetes CAP Theorem PostGreSQL ORM (Object Relational Mapping) Information Schema PG Catalog Interleaved Tables Vertica Spark Change Data Capture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Jun 4, 2018 • 40min

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

Summary Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage. Interview Introduction How did you get involved in the area of data management? Can you give a high level description of what ArangoDB is and the motivation for creating it? What is the story behind the name? How is ArangoDB constructed? How does the underlying engine store the data to allow for the different ways of viewing it? What are some of the benefits of multi-model data storage? When does it become problematic? For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango? How does it compare to OrientDB? What are the options for scaling a running system? What are the limitations in terms of network architecture or data volumes? One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture? What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code? What are some of the most interesting or surprising uses of this functionality that you have seen? What are some of the most challenging technical and business aspects of building and promoting ArangoDB? What do you have planned for the future of ArangoDB? Contact Info Jan Steemann jsteemann on GitHub @steemann on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links ArangoDB Köln Multi-model Database Graph Algorithms Apache 2 C++ ArangoDB Foxx Raft Protocol Target Partners RocksDB AQL (ArangoDB Query Language) OrientDB PostGreSQL OrientDB Studio Google Spanner 3-Tier Architecture Thomson-Reuters Arango Search Dell EMC Google S2 Index ArangoDB Geographic Functionality JSON Schema The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
May 28, 2018 • 48min

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Summary Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service Interview Introduction How did you get involved in the area of data management? What is Alooma and what is the origin story? How is the Alooma platform architected? I want to go into stream VS batch here What are the most challenging components to scale? How do you manage the underlying infrastructure to support your SLA of 5 nines? What are some of the complexities introduced by processing data from multiple customers with various compliance requirements? How do you sandbox user’s processing code to avoid security exploits? What are some of the potential pitfalls for automatic schema management in the target database? Given the large number of integrations, how do you maintain the What are some challenges when creating integrations, isn’t it simply conforming with an external API? For someone getting started with Alooma what does the workflow look like? What are some of the most challenging aspects of building and maintaining Alooma? What are your plans for the future of Alooma? Contact Info LinkedIn @yairwein on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alooma Convert Media Data Integration ESB (Enterprise Service Bus) Tibco Mulesoft ETL (Extract, Transform, Load) Informatica Microsoft SSIS OLAP Cube S3 Azure Cloud Storage Snowflake DB Redshift BigQuery Salesforce Hubspot Zendesk Spark The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps RDBMS (Relational Database Management System) SaaS (Software as a Service) Change Data Capture Kafka Storm Google Cloud PubSub Amazon Kinesis Alooma Code Engine Zookeeper Idempotence Kafka Streams Kubernetes SOC2 Jython Docker Python Javascript Ruby Scala PII (Personally Identifiable Information) GDPR (General Data Protection Regulation) Amazon EMR (Elastic Map Reduce) Sequoia Capital Lightspeed Investors Redis Aerospike Cassandra MongoDB The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
May 21, 2018 • 42min

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Summary Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Presto is? What are some of the common use cases and deployment patterns for Presto? How does Presto compare to Drill or Impala? What is it about Presto that led you to building a business around it? What are some of the most challenging aspects of running and scaling Presto? For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries? How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with? What are some cases in which Presto is not the right solution? What types of support have you found to be the most commonly requested? What are some of the types of tooling or improvements that you have made to Presto in your distribution? What are some of the notable changes that your team has contributed upstream to Presto? Contact Info Website E-mail Twitter – @starburstdata Twitter – @prestodb Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Starburst Data Presto Hadapt Hadoop Hive Teradata PrestoCare Cost Based Optimizer ANSI SQL Spill To Disk Tempto Benchto Geospatial Functions Cassandra Accumulo Kafka Redis PostGreSQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rssSupport Data Engineering Podcast
undefined
May 14, 2018 • 26min

Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal. Interview Andy Eschbacher From Carto What are the challenges associated with storing geospatial data? What are some of the common misconceptions that people have about working with geospatial data? Contact Info andy-esch on GitHub @MrEPhysics on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Carto Geospatial Analysis GeoJSON Todd Blaschka From TigerGraph What are graph databases and how do they differ from relational engines? What are some of the common difficulties that people have when deling with graph algorithms? How does data modeling for graph databases differ from relational stores? Contact Info LinkedIn @toddblaschka on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TigerGraph Graph Databases The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
May 7, 2018 • 33min

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation. Interview Alan Anders from Applecart What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities? What are the biggest technical hurdles at Applecart? Contact Info @alanjanders on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Spark DataBricks DataBricks Delta Applecart Stepan Pushkarev from Hydrosphere.io What is Hydropshere.io? What metrics do you track to determine when a machine learning model is not producing an appropriate output? How do you determine which data points to sample for retraining the model? How does the role of a machine learning engineer differ from data engineers and data scientists? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hydrosphere Machine Learning Engineer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Apr 30, 2018 • 45min

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

Summary Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence Interview Introduction How did you get involved in the area of data management? The current goal for most companies is to be “data driven”. How would you define that concept? How does Metabase assist in that endeavor? What is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL? What level of complexity is possible with the query builder? What have you found to be the typical use cases for Metabase in the context of an organization? How do you manage scaling for large or complex queries? What was the motivation for using Clojure as the language for implementing Metabase? What is involved in adding support for a new data source? What are the differentiating features of Metabase that would lead someone to choose it for their organization? What have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective? What do you have planned for the future of Metabase? Contact Info Sameer salsakran on GitHub @sameer_alsakran on Twitter LinkedIn Metabase Website @metabase on Twitter metabase on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Expa Metabase Blackjet Hadoop Imeem Maslow’s Hierarchy of Data Needs 2 Sided Marketplace Honeycomb Interview Excel Tableau Go-JEK Clojure React Python Scala JVM Redash How To Lie With Data Stripe Braintree Payments The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app