

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Dec 31, 2018 • 45min
Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63
Summary
As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Pravega is and the story behind it?
What are the use cases for Pravega and how does it fit into the data ecosystem?
How does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data?
How do you represent a stream on-disk?
What are the benefits of using this format for persisted streams?
One of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides?
I am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster?
For someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward?
What are some of the unique system design patterns that are made possible by Pravega?
How is Pravega architected internally?
What is involved in integrating engines such as Spark, Flink, or Storm with Pravega?
A common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem?
Does it have any special capabilities for simplifying processing of out-of-order events?
For someone planning a deployment of Pravega, what is involved in building and scaling a cluster?
What are some of the operational edge cases that users should be aware of?
What are some of the most interesting, useful, or challenging experiences that you have had while building Pravega?
What are some cases where you would recommend against using Pravega?
What is in store for the future of Pravega?
Contact Info
tkaitchuk on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Pravega
Amazon SQS (Simple Queue Service)
Amazon Simple Workflow Service (SWF)
Azure
EMC
Zookeeper
Podcast Episode
Bookkeeper
Kafka
Pulsar
Podcast Episode
RocksDB
Flink
Podcast Episode
Spark
Podcast Episode
Heron
Lambda Architecture
Kappa Architecture
Erasure Code
Flink Forward Conference
CAP Theorem
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Dec 24, 2018 • 1h 4min
Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62
Summary
Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what PipelineDB is and the motivation for creating it?
What are the major use cases that it enables?
What are some example applications that are uniquely well suited to the capabilities of PipelineDB?
What are the major concepts and components that users of PipelineDB should be familiar with?
Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?
What are some of the common patterns for populating data streams?
What are the options for scaling PipelineDB systems, both vertically and horizontally?
How much elasticity does the system support in terms of changing volumes of inbound data?
What are some of the limitations or edge cases that users should be aware of?
Given that inbound data is not persisted to disk, how do you guard against data loss?
Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?
Can a separate table be used as an input stream?
Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?
What are some of the features that you have found to be the most useful which users might initially overlook?
What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?
What are some of the most challenging aspects of building continuous aggregates on unbounded data?
What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?
What are some of the most interesting or unexpected ways that you have seen PipelineDB used?
When is PipelineDB the wrong choice?
What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?
Contact Info
Derek
derekjn on GitHub
LinkedIn
Usman
@usmanm on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
PipelineDB
Stride
PostgreSQL
Podcast Episode
AdRoll
Probabilistic Data Structures
TimescaleDB
[Podcast Episode](
Hive
Redshift
Kafka
Kinesis
ZeroMQ
Nanomsg
HyperLogLog
Bloom Filter
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Dec 17, 2018 • 39min
Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61
Summary
Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing your definition of a data pipeline?
At what point in the life of a project or organization should you start thinking about building a pipeline?
In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?
What metrics/use cases should you be optimizing for at this point?
What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?
How do the design requirements for a data pipeline change as you reach this stage?
What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?
What are some of the changes that are necessary as you move to a large scale data pipeline?
At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?
In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?
When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?
Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?
What are your selection criteria when determining what workflow or ETL engines to use in your pipeline?
How has your preference of build vs buy changed at different scales of operation and as new/different projects become available?
What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?
What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?
What are your plans for improving your current pipeline at Grubhub?
What are some references that you recommend for anyone who is designing a new data platform?
Contact Info
@sirchristian on Twitter
Blog
sirchristian on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Scaling ETL blog post
GrubHub
Data Warehouse
Redshift
Spark
Spark In Action Podcast Episode
Hive
Amazon EMR
Looker
Podcast Episode
Redash
Metabase
Podcast Episode
A Primer on Enterprise Data Curation
Pub/Sub (Publish-Subscribe Pattern)
Change Data Capture
Jenkins
Python
Azkaban
Luigi
Zendesk
Data Lineage
AirBnB Engineering Blog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Dec 10, 2018 • 51min
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60
Summary
Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Spark is?
What are some of the main use cases for Spark?
What are some of the problems that Spark is uniquely suited to address?
Who uses Spark?
What are the tools offered to Spark users?
How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?
For someone building on top of Spark what are the main software design paradigms?
How does the design of an application change as you go from a local development environment to a production cluster?
Once your application is written, what is involved in deploying it to a production environment?
What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?
What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?
What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?
What are the limitations of the Spark programming model?
What are the cases where Spark is the wrong choice?
What was your motivation for writing a book about Spark?
Who is the target audience?
What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?
What advice do you have for anyone who is considering or currently using Spark?
Contact Info
@jgperrin on Twitter
Blog
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Book Discount
Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com
Links
Apache Spark
Spark In Action
Book code examples in GitHub
Informix
International Informix Users Group
MySQL
Microsoft SQL Server
ETL (Extract, Transform, Load)
Spark SQL and Spark In Action‘s chapter 11
Spark ML and Spark In Action‘s chapter 18
Spark Streaming (structured) and Spark In Action‘s chapter 10
Spark GraphX
Hadoop
Jupyter
Podcast Interview
Zeppelin
Databricks
IBM Watson Studio
Kafka
Flink
Podcast Episode
AWS Kinesis
Yarn
HDFS
Hive
Scala
PySpark
DAG
Spark Catalyst
Spark Tungsten
Spark UDF
AWS EMR
Mesos
DC/OS
Kubernetes
Dataframes
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Dec 3, 2018 • 54min
Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59
Summary
Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Zookeeper is and how the project got started?
What are the main motivations for using a centralized coordination service for distributed systems?
What are the distributed systems primitives that are built into Zookeeper?
What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
Can you discuss how Zookeeper is architected and how that design has evolved over time?
What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
What are the scaling factors for Zookeeper?
What are the edge cases that users should be aware of?
Where does it fall on the axes of the CAP theorem?
What are the main failure modes for Zookeeper?
How much of the recovery logic is left up to the end user of the Zookeeper cluster?
Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
What are some of the cases where Zookeeper is the wrong choice?
How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
If you were to start the project over today, what would you do differently?
Would you still use Java?
What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
What do you have planned for the future of Zookeeper?
Contact Info
@phunt on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Zookeeper
Cloudera
Google Chubby
Sourceforge
HBase
High Availability
Fallacies of distributed computing
Falsehoods programmers believe about networking
Consul
EtcD
Apache Curator
Raft Consensus Algorithm
Zookeeper Atomic Broadcast
SSD Write Cliff
Apache Kafka
Apache Flink
Podcast Episode
HDFS
Kubernetes
Netty
Protocol Buffers
Avro
Rust
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 26, 2018 • 39min
Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58
Summary
When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Tomer Shiran about Dremio, the open source data as a service platform
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Dremio is and how the project and business got started?
What was the motivation for keeping your primary product open source?
What is the governance model for the project?
How does Dremio fit in the current landscape of data tools?
What are some use cases that Dremio is uniquely equipped to support?
Do you think that Dremio obviates the need for a data warehouse or large scale data lake?
How is Dremio architected internally?
How has that architecture evolved from when it was first built?
There are a large array of components (e.g. governance, lineage, catalog) built into Dremio that are often found in dedicated products. What are some of the strategies that you have as a business and development team to manage and integrate the complexity of the product?
What are the benefits of integrating all of those capabilities into a single system?
What are the drawbacks?
One of the useful features of Dremio is the granular access controls. Can you discuss how those are implemented and controlled?
For someone who is interested in deploying Dremio to their environment what is involved in getting it installed?
What are the scaling factors?
What are some of the most exciting features that have been added in recent releases?
When is Dremio the wrong choice?
What have been some of the most challenging aspects of building, maintaining, and growing the technical and business platform of Dremio?
What do you have planned for the future of Dremio?
Contact Info
Tomer
@tshiran on Twitter
LinkedIn
Dremio
Website
@dremio on Twitter
dremio on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Dremio
MapR
Presto
Business Intelligence
Arrow
Tableau
Power BI
Jupyter
OLAP Cube
Apache Foundation
Hadoop
Nikon DSLR
Spark
ETL (Extract, Transform, Load)
Parquet
Avro
K8s
Helm
Yarn
Gandiva Initiative for Apache Arrow
LLVM
TLS
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 19, 2018 • 48min
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57
Summary
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Flink is and how the project got started?
What are some of the primary ways that Flink is used?
How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?
What are some use cases that Flink is uniquely qualified to handle?
Where does Flink fit into the current data landscape?
How is Flink architected?
How has that architecture evolved?
Are there any aspects of the current design that you would do differently if you started over today?
How does scaling work in a Flink deployment?
What are the scaling limits?
What are some of the failure modes that users should be aware of?
How is the statefulness of a cluster managed?
What are the mechanisms for managing conflicts?
What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?
Can state be shared across processes or tasks within a Flink cluster?
What are the comparative challenges of working with bounded vs unbounded streams of data?
How do you handle out of order events in Flink, especially as the delay for a given event increases?
For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?
What are some of the most challenging or complicated aspects of building and maintaining Flink?
What are some of the most interesting or unexpected ways that you have seen Flink used?
What are some of the improvements or new features that are planned for the future of Flink?
What are some features or use cases that you are explicitly not planning to support?
For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?
What do they find most interesting or exciting?
Contact Info
LinkedIn
@fhueske on Twitter
fhueske on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Flink
Data Artisans
IBM
DB2
Technische Universität Berlin
Hadoop
Relational Database
Google Cloud Dataflow
Spark
Cascading
Java
RocksDB
Flink Checkpoints
Flink Savepoints
Kafka
Pulsar
Storm
Scala
LINQ (Language INtegrated Query)
SQL
Backpressure
Watermarks
HDFS
S3
Avro
JSON
Hive Metastore
Dell EMC
Pravega
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 11, 2018 • 52min
How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56
Summary
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Upsolver is and how it got started?
What are your goals for the platform?
There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
What are the shortcomings of a data lake architecture?
How is Upsolver architected?
How has that architecture changed over time?
How do you manage schema validation for incoming data?
What would you do differently if you were to start over today?
What are the biggest challenges at each of the major stages of the data lake?
What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?
When is Upsolver the wrong choice for an organization considering implementation of a data platform?
Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?
What features or improvements do you have planned for the future of Upsolver?
Contact Info
Yoni
yoniiny on GitHub
LinkedIn
Upsolver
Website
@upsolver on Twitter
LinkedIn
Facebook
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Upsolver
Data Lake
Israeli Army
Data Warehouse
Data Engineering Podcast Episode About Data Curation
Three Vs
Kafka
Spark
Presto
Drill
Spot Instances
Object Storage
Cassandra
Redis
Latency
Avro
Parquet
ORC
Data Engineering Podcast Episode About Data Serialization Formats
SSTables
Run Length Encoding
CSV (Comma Separated Values)
Protocol Buffers
Kinesis
ETL
DevOps
Prometheus
Cloudwatch
DataDog
InfluxDB
SQL
Pandas
Confluent
KSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 5, 2018 • 58min
Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55
Summary
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Looker is and the problem that it is aiming to solve?
How do you define business intelligence?
How is Looker unique from other approaches to business intelligence in the enterprise?
How does it compare to open source platforms for BI?
Can you describe the technical infrastructure that supports Looker?
Given that you are connecting to the customer’s data store, how do you ensure sufficient security?
For someone who is using Looker, what does their workflow look like?
How does that change for different user roles (e.g. data engineer vs sales management)
What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency?
What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?
What are the portions of the Looker architecture that you would do differently if you were to start over today?
What are some of the most interesting or unusual uses of Looker that you have seen?
What is in store for the future of Looker?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Looker
Upworthy
MoveOn.org
LookML
SQL
Business Intelligence
Data Warehouse
Linux
Hadoop
BigQuery
Snowflake
Redshift
DB2
PostGres
ETL (Extract, Transform, Load)
ELT (Extract, Load, Transform)
Airflow
Luigi
NiFi
Data Curation Episode
Presto
Hive
Athena
DRY (Don’t Repeat Yourself)
Looker Action Hub
Salesforce
Marketo
Twilio
Netscape Navigator
Dynamic Pricing
Survival Analysis
DevOps
BigQuery ML
Snowflake Data Sharehouse
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 29, 2018 • 41min
Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54
Summary
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles
Interview
Introduction
How did you get involved in the area of data management?
Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
Where are you using notebooks and where are you not?
What is the technical infrastructure that you have built to suppport that design choice?
Which team was driving the effort?
Was it difficult to get buy in across teams?
How much shared code have you been able to consolidate or reuse across teams/roles?
Have you investigated the use of any of the other notebook platforms for similar workflows?
What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?
What are some of the limitations of the notebook environment for the work that you are doing?
What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?
What are some of the projects that are ongoing or planned for the future that you are most excited by?
Contact Info
Matthew Seal
Email
LinkedIn
@codeseal on Twitter
MSeal on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Netflix Notebook Blog Posts
Nteract Tooling
OpenGov
Project Jupyter
Zeppelin Notebooks
Papermill
Titus
Commuter
Scala
Python
R
Emacs
NBDime
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast


