

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Nov 26, 2018 • 39min
Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58
Summary
When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Tomer Shiran about Dremio, the open source data as a service platform
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Dremio is and how the project and business got started?
What was the motivation for keeping your primary product open source?
What is the governance model for the project?
How does Dremio fit in the current landscape of data tools?
What are some use cases that Dremio is uniquely equipped to support?
Do you think that Dremio obviates the need for a data warehouse or large scale data lake?
How is Dremio architected internally?
How has that architecture evolved from when it was first built?
There are a large array of components (e.g. governance, lineage, catalog) built into Dremio that are often found in dedicated products. What are some of the strategies that you have as a business and development team to manage and integrate the complexity of the product?
What are the benefits of integrating all of those capabilities into a single system?
What are the drawbacks?
One of the useful features of Dremio is the granular access controls. Can you discuss how those are implemented and controlled?
For someone who is interested in deploying Dremio to their environment what is involved in getting it installed?
What are the scaling factors?
What are some of the most exciting features that have been added in recent releases?
When is Dremio the wrong choice?
What have been some of the most challenging aspects of building, maintaining, and growing the technical and business platform of Dremio?
What do you have planned for the future of Dremio?
Contact Info
Tomer
@tshiran on Twitter
LinkedIn
Dremio
Website
@dremio on Twitter
dremio on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Dremio
MapR
Presto
Business Intelligence
Arrow
Tableau
Power BI
Jupyter
OLAP Cube
Apache Foundation
Hadoop
Nikon DSLR
Spark
ETL (Extract, Transform, Load)
Parquet
Avro
K8s
Helm
Yarn
Gandiva Initiative for Apache Arrow
LLVM
TLS
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 19, 2018 • 48min
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57
Summary
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Flink is and how the project got started?
What are some of the primary ways that Flink is used?
How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?
What are some use cases that Flink is uniquely qualified to handle?
Where does Flink fit into the current data landscape?
How is Flink architected?
How has that architecture evolved?
Are there any aspects of the current design that you would do differently if you started over today?
How does scaling work in a Flink deployment?
What are the scaling limits?
What are some of the failure modes that users should be aware of?
How is the statefulness of a cluster managed?
What are the mechanisms for managing conflicts?
What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?
Can state be shared across processes or tasks within a Flink cluster?
What are the comparative challenges of working with bounded vs unbounded streams of data?
How do you handle out of order events in Flink, especially as the delay for a given event increases?
For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?
What are some of the most challenging or complicated aspects of building and maintaining Flink?
What are some of the most interesting or unexpected ways that you have seen Flink used?
What are some of the improvements or new features that are planned for the future of Flink?
What are some features or use cases that you are explicitly not planning to support?
For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?
What do they find most interesting or exciting?
Contact Info
LinkedIn
@fhueske on Twitter
fhueske on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Flink
Data Artisans
IBM
DB2
Technische Universität Berlin
Hadoop
Relational Database
Google Cloud Dataflow
Spark
Cascading
Java
RocksDB
Flink Checkpoints
Flink Savepoints
Kafka
Pulsar
Storm
Scala
LINQ (Language INtegrated Query)
SQL
Backpressure
Watermarks
HDFS
S3
Avro
JSON
Hive Metastore
Dell EMC
Pravega
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 11, 2018 • 52min
How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56
Summary
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Upsolver is and how it got started?
What are your goals for the platform?
There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
What are the shortcomings of a data lake architecture?
How is Upsolver architected?
How has that architecture changed over time?
How do you manage schema validation for incoming data?
What would you do differently if you were to start over today?
What are the biggest challenges at each of the major stages of the data lake?
What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?
When is Upsolver the wrong choice for an organization considering implementation of a data platform?
Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?
What features or improvements do you have planned for the future of Upsolver?
Contact Info
Yoni
yoniiny on GitHub
LinkedIn
Upsolver
Website
@upsolver on Twitter
LinkedIn
Facebook
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Upsolver
Data Lake
Israeli Army
Data Warehouse
Data Engineering Podcast Episode About Data Curation
Three Vs
Kafka
Spark
Presto
Drill
Spot Instances
Object Storage
Cassandra
Redis
Latency
Avro
Parquet
ORC
Data Engineering Podcast Episode About Data Serialization Formats
SSTables
Run Length Encoding
CSV (Comma Separated Values)
Protocol Buffers
Kinesis
ETL
DevOps
Prometheus
Cloudwatch
DataDog
InfluxDB
SQL
Pandas
Confluent
KSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 5, 2018 • 58min
Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55
Summary
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Looker is and the problem that it is aiming to solve?
How do you define business intelligence?
How is Looker unique from other approaches to business intelligence in the enterprise?
How does it compare to open source platforms for BI?
Can you describe the technical infrastructure that supports Looker?
Given that you are connecting to the customer’s data store, how do you ensure sufficient security?
For someone who is using Looker, what does their workflow look like?
How does that change for different user roles (e.g. data engineer vs sales management)
What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency?
What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?
What are the portions of the Looker architecture that you would do differently if you were to start over today?
What are some of the most interesting or unusual uses of Looker that you have seen?
What is in store for the future of Looker?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Looker
Upworthy
MoveOn.org
LookML
SQL
Business Intelligence
Data Warehouse
Linux
Hadoop
BigQuery
Snowflake
Redshift
DB2
PostGres
ETL (Extract, Transform, Load)
ELT (Extract, Load, Transform)
Airflow
Luigi
NiFi
Data Curation Episode
Presto
Hive
Athena
DRY (Don’t Repeat Yourself)
Looker Action Hub
Salesforce
Marketo
Twilio
Netscape Navigator
Dynamic Pricing
Survival Analysis
DevOps
BigQuery ML
Snowflake Data Sharehouse
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 29, 2018 • 41min
Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54
Summary
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles
Interview
Introduction
How did you get involved in the area of data management?
Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
Where are you using notebooks and where are you not?
What is the technical infrastructure that you have built to suppport that design choice?
Which team was driving the effort?
Was it difficult to get buy in across teams?
How much shared code have you been able to consolidate or reuse across teams/roles?
Have you investigated the use of any of the other notebook platforms for similar workflows?
What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?
What are some of the limitations of the notebook environment for the work that you are doing?
What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?
What are some of the projects that are ongoing or planned for the future that you are most excited by?
Contact Info
Matthew Seal
Email
LinkedIn
@codeseal on Twitter
MSeal on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Netflix Notebook Blog Posts
Nteract Tooling
OpenGov
Project Jupyter
Zeppelin Notebooks
Papermill
Titus
Commuter
Scala
Python
R
Emacs
NBDime
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 22, 2018 • 46min
Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53
Summary
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com
Interview
Introductions
How did you get introduced to Python?
Can you start by describing what Deon is and your motivation for creating it?
Why a checklist, specifically? What’s the advantage of this over an oath, for example?
What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
What is the typical workflow for a team that is using Deon in their projects?
Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
Have you received pushback on any of the default items?
How does Deon simplify communication around ethics across team boundaries?
What are some of the most often overlooked items?
What are some of the most difficult ethical concerns to comply with for a typical data science project?
How has Deon helped you at Driven Data?
What are the customer facing impacts of embedding a discussion of ethics in the product development process?
Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
What are your hopes for the future of the Deon project?
Keep In Touch
Emily
LinkedIn
ejm714 on GitHub
Peter
LinkedIn
@pjbull on Twitter
pjbull on GitHub
Driven Data
@drivendataorg on Twitter
drivendataorg on GitHub
Website
Picks
Tobias
Richard Bond Glass Art
Emily
Tandem Coffee in Portland, Maine
Peter
The Model Bakery in Saint Helena and Napa, California
Links
Deon
Driven Data
International Development
Brookings Institution
Stata
Econometrics
Metis Bootcamp
Pandas
Podcast Episode
C#
.NET
Podcast.__init__ Episode On Software Ethics
Jupyter Notebook
Podcast Episode
Word2Vec
cookiecutter data science
Logistic Regression
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 15, 2018 • 54min
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52
Summary
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Iceberg is and the motivation for creating it?
Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
How has the use of Iceberg simplified your work at Netflix?
How is the reference implementation architected and how has it evolved since you first began work on it?
What is involved in deploying it to a user’s environment?
For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
Is there a migration path for pre-existing tables into the Iceberg format?
How is schema evolution managed at the file level?
How do you handle files on disk that don’t contain all of the fields specified in a table definition?
One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?
What are the unique challenges posed by using S3 as the basis for a data lake?
What are the benefits that outweigh the difficulties?
What have been some of the most challenging or contentious details of the specification to define?
What are some things that you have explicitly left out of the specification?
What are your long-term goals for the Iceberg specification?
Do you anticipate the reference implementation continuing to be used and maintained?
Contact Info
rdblue on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Iceberg Reference Implementation
Iceberg Table Specification
Netflix
Hadoop
Cloudera
Avro
Parquet
Spark
S3
HDFS
Hive
ORC
S3mper
Git
Metacat
Presto
Pig
DDL (Data Definition Language)
Cost-Based Optimization
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 9, 2018 • 57min
Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov
SummaryOne of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.PreambleHello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterviewIntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL?Contact Info@nikitashamgunov on TwitterLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksMemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetesThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Oct 1, 2018 • 53min
Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50
Summary
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph
Interview
Introduction
How did you get involved in the area of data management?
Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
How do you define the concept of a knowledge graph?
What are the processes involved in constructing a knowledge graph?
Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?
What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?
How do you manage the software lifecycle for your ETL code?
What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?
What are the current challenges that you are facing in building and scaling your data infrastructure?
How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
What techniques are you using to manage accuracy and consistency in the data that you ingest?
Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?
What are the weak spots in your platform that you are planning to address in upcoming projects?
If you were to start from scratch today, what would you have done differently?
What are some of the most interesting or unexpected uses of your product that you have seen?
What is in store for the future of Enigma?
Contact Info
Email
Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Enigma
Chicago Tribune
NPR
Quartz
CSVKit
Agate
Knowledge Graph
Taxonomy
Concourse
Airflow
Docker
S3
Data Lake
Parquet
Podcast Episode
Spark
AWS Neptune
AWS Batch
Money Laundering
Jupyter Notebook
Papermill
Jupytext
Cauldron: The Un-Notebook
Podcast.__init__ Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Sep 24, 2018 • 50min
A Primer On Enterprise Data Curation with Todd Walter - Episode 49
Summary
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence
Interview
Introduction
How did you get involved in the area of data management?
How do you define data curation?
What are some of the high level concerns that are encapsulated in that effort?
How does the size and maturity of a company affect the ways that they architect and interact with their data systems?
Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?
What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?
What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?
As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?
In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?
What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?
Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?
ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?
What are some of the areas of data architecture and curation that are most often forgotten or ignored?
What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Teradata
Data Architecture
Data Curation
Data Warehouse
Chief Data Officer
ETL (Extract, Transform, Load)
Data Lake
Metadata
Data Lineage
Data Provenance
Strata Conference
ELT (Extract, Load, Transform)
Map-Reduce
Hive
Pig
Spark
Data Governance
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast