

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Dec 30, 2019 • 46min
Building The DataDog Platform For Processing Timeseries Data At Massive Scale
Summary
DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog
Interview
Introduction
How did you get involved in the area of data management?
For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with?
What are the main components of your platform for managing that information?
How are the data teams at DataDog organized and what are your primary responsibilities in the organization?
What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing?
What are some of the strategies which have proven to be most useful in overcoming those challenges?
Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met?
Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information?
Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered?
What are some of the upcoming projects that you have planned for the upcoming months and years?
What are some of the technologies, patterns, or practices that you are hoping to adopt?
Contact Info
LinkedIn
@databuryat on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
DataDog
Hadoop
Hive
Yarn
Chef
SRE == Site Reliability Engineer
Application Performance Management (APM)
Apache Kafka
RocksDB
Cassandra
Apache Parquet data serialization format
SLA == Service Level Agreement
WatchDog
Apache Spark
Podcast Episode
Apache Pig
Databricks
JVM == Java Virtual Machine
Kubernetes
SSIS (SQL Server Integration Services)
Pentaho
JasperSoft
Apache Airflow
Podcast.__init__ Episode
Apache NiFi
Podcast Episode
Luigi
Dagster
Podcast Episode
Prefect
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 23, 2019 • 48min
Building The Materialize Engine For Interactive Streaming Analytics In SQL
Summary
Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Materialize is and the problems that you are aiming to solve with it?
What was your motivation for creating it?
What use cases does Materialize enable?
What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize?
How does it fit into the broader ecosystem of data tools and platforms?
What are some of the use cases that Materialize is uniquely able to support?
How is Materialize architected and how has the design evolved since you first began working on it?
Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?
What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?
In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?
A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize?
Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine?
What are the considerations and constraints on maintaining the full history of the source data within Materialize?
For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources?
What is the workflow for defining and maintaining a set of views?
What are some of the complexities that users might face in ensuring the ongoing functionality of those views?
For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of?
The Materialize product is currently pre-release. What are the remaining steps before launching it?
What do you have planned for the future of the product and company?
Contact Info
frankmcsherry on GitHub
@frankmcsherry on Twitter
Blog
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Materialize
Timely Dataflow
Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks
[Naiad](Programs from SequentialBuilding Blocks): A Timely Dataflow System
Differential Privacy
PageRank
Data Council Presentation on Materialize
Change Data Capture
Debezium
Apache Spark
Podcast Episode
Flink
Podcast Episode
Go language
Rust
Haskell
Rust Borrow Checker
GDB (GNU Debugger)
Avro
Apache Calcite
ANSI SQL 92
Correlated Subqueries
OOM (Out Of Memory) Killer
Log-Structured Merge Tree
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 16, 2019 • 1h 2min
Solving Data Lineage Tracking And Data Discovery At WeWork
Summary
Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Marquez is?
What was missing in existing metadata management platforms that necessitated the creation of Marquez?
How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?
How does it compare to the Amundsen platform that Lyft recently released?
What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see?
What are some of the capabilities that are unique to Marquez and how are you using them at WeWork?
What are the primary resource types that you support in Marquez?
What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?
Can you explain how Marquez is architected and how the design has evolved since you first began working on it?
Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?
What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?
How is the metadata itself stored and managed in Marquez?
How much up-front data modeling is necessary and what types of schema representations are supported?
Can you talk through the overall workflow of someone using Marquez in their environment?
What is involved in registering and updating datasets?
How do you define and track the health of a given dataset?
What are some of the interesting questions that can be answered from the information stored in Marquez?
What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases?
For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it?
What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform?
When is Marquez the wrong choice for a metadata repository?
What do you have planned for the future of Marquez?
Contact Info
Julien Le Dem
@J_ on Twitter
Email
julienledem on GitHub
Willy
LinkedIn
@wslulciuc on Twitter
wslulciuc on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Marquez
DataEngConf Presentation
WeWork
Canary
Yahoo
Dremio
Hadoop
Pig
Parquet
Podcast Episode
Airflow
Apache Atlas
Amundsen
Podcast Episode
Uber DataBook
LinkedIn DataHub
Iceberg Table Format
Podcast Episode
Delta Lake
Podcast Episode
Great Expectations data pipeline unit testing framework
Podcast.__init__ Episode
Redshift
SnowflakeDB
Podcast Episode
Apache Kafka Schema Registry
Podcast Episode
Open Tracing
Jaeger
Zipkin
DropWizard Java framework
Marquez UI
Cayley Graph Database
Kubernetes
Marquez Helm Chart
Marquez Docker Container
Dagster
Podcast Episode
Luigi
DBT
Podcast Episode
Thrift
Protocol Buffers
The intro and outro music is from

Dec 9, 2019 • 59min
SnowflakeDB: The Data Warehouse Built For The Cloud
Summary
Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?
How does it compare to the other available platforms for data warehousing?
How does it differ from traditional data warehouses?
How does the performance and flexibility affect the data modeling requirements?
Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces?
Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?
What are some of the current limitations that you are struggling with?
For someone getting started with Snowflake what is involved with loading data into the platform?
What is their workflow for allocating and scaling compute capacity and running anlyses?
One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen?
What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about?
When is SnowflakeDB the wrong choice?
What are some of the plans for the future of SnowflakeDB?
Contact Info
LinkedIn
Website
@KentGraziano on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
SnowflakeDB
Free Trial
Stack Overflow
Data Warehouse
Oracle DB
MPP == Massively Parallel Processing
Shared Nothing Architecture
Multi-Cluster Shared Data Architecture
Google BigQuery
AWS Redshift
AWS Redshift Spectrum
Presto
Podcast Episode
SnowflakeDB Semi-Structured Data Types
Hive
ACID == Atomicity, Consistency, Isolation, Durability
3rd Normal Form
Data Vault Modeling
Dimensional Modeling
JSON
AVRO
Parquet
SnowflakeDB Virtual Warehouses
CRM == Customer Relationship Management
Master Data Management
Podcast Episode
FoundationDB
Podcast Episode
Apache Spark
Podcast Episode
SSIS == SQL Server Integration Services
Talend
Informatica
Fivetran
Podcast Episode
Matillion
Apache Kafka
Snowpipe
Snowflake Data Exchange
OLTP == Online Transaction Processing
GeoJSON
Snowflake Documentation
SnowAlert
Splunk
Data Catalog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 3, 2019 • 46min
Organizing And Empowering Data Engineers At Citadel
Summary
The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Michael Watson and Robert Krzyzanowski about the technical and organizational challenges that he and his team are working on at Citadel
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the size and structure of the data engineering teams at Citadel?
How have the scope and nature of responsibilities for data engineers evolved over the past few years at Citadel as more and better tools and platforms have been made available in the space and machine learning techniques have grown more sophisticated?
Can you describe the types of data that you are working with at Citadel?
What is the process for identifying, evaluating, and ingesting new sources of data?
What are some of the common core aspects of your data infrastructure?
What are some of the ways that it differs across teams or projects?
How involved are data engineers in the overall product design and delivery lifecycle?
For someone who joins your team as a data engineer, what are some of the options available to them for a career path?
What are some of the challenges that you are currently facing in managing the data lifecycle for projects at Citadel?
What are some tools or practices that you are excited to try out?
Contact Info
Michael
LinkedIn
@detroitcoder on Twitter
detroitcoder on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Citadel
Python
Hedge Fund
Quantitative Trading
Citadel Securities
Apache Airflow
Jupyter Hub
Alembic database migrations for SQLAlchemy
Terraform
DQM == Data Quality Management
Great Expectations
Podcast.__init__ Episode
Nomad
RStudio
Active Directory
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 26, 2019 • 1h 1min
Building A Real Time Event Data Warehouse For Sentry
Summary
The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Ted Kaemming and James Cunningham about Snuba, the new open source search service at Sentry implemented on top of Clickhouse
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the internal and user-facing issues that you were facing at Sentry with the existing search capabilities?
What did the previous system look like?
What was your design criteria for building a new platform?
What was your initial list of possible system components and what was your evaluation process that resulted in your selection of Clickhouse?
Can you describe the system architecture of Snuba and some of the ways that it differs from your initial ideas of how it would work?
What have been some of the sharp edges of Clickhouse that you have had to engineer around?
How have you found the operational aspects of Clickhouse?
How did you manage the introduction of this new piece of infrastructure to a business that was already handling massive amounts of real-time data?
What are some of the downstream benefits of using Clickhouse for managing event data at Sentry?
For someone who is interested in using Snuba for their own purposes, how flexible is it for different domain contexts?
What are some of the other data challenges that you are currently facing at Sentry?
What is your next highest priority for evolving or rebuilding to address technical or business challenges?
Contact Info
James
@JTCunning on Twitter
JTCunning on GitHub
Ted
tkaemming on GitHub
Website
@tkaemming on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Sentry
Podcast.__init__ Episode
Snuba
Blog Post
Clickhouse
Podcast Episode
Disqus
Urban Airship
HBase
Google Bigtable
PostgreSQL
Redis
HyperLogLog
Riak
Celery
RabbitMQ
Apache Spark
Presto
Cassandra
Apache Kudu
Apache Pinot
Apache Druid
Flask
Apache Kafka
Cassandra Tombstone
Sentry Blog
XML
Change Data Capture
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 18, 2019 • 56min
Escaping Analysis Paralysis For Your Data Platform With Data Virtualization
Summary
With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Matt Baird about AtScale, a platform that
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the AtScale platform and how it fits in the ecosystem of data tools?
What was your motivation for building the platform and what were some of the early challenges that you faced in achieving your current level of success?
How is the AtScale platform architected and what have been some of the main areas of evolution and change since you first began building it?
How has the surrounding data ecosystem changed since AtScale was founded?
How are current industry trends influencing your product focus?
Can you talk through the workflow for someone implementing AtScale?
What are some of the main use cases that benefit from data virtualization capabilities?
How does it influence the relevancy of data warehouses or data lakes?
What are some of the types of tools or patterns that AtScale replaces in a data platform?
What are some of the most interesting or unexpected ways that you have seen AtScale used?
What have been some of the most challenging aspects of building and growing the platform?
When is AtScale the wrong choice?
What do you have planned for the future of the platform and business?
Contact Info
LinkedIn
@zetty on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
AtScale
PeopleSoft
Oracle
Hadoop
PrestoDB
Impala
Apache Kylin
Apache Druid
Go Language
Scala
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 11, 2019 • 51min
Designing For Data Protection
Summary
The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what is encompassed by the idea of data protection?
What regulations control the enforcement of data protection requirements, and how can we determine whether we are subject to their rules?
What are some of the conflicts and constraints that act against our efforts to implement data protection?
How much of data protection is handled through technical implementation as compared to organizational policies and reporting requirements?
Can you give some examples of the types of information that are subject to data protection?
One of the challenges in data management generally is tracking the presence and usage of any given information. What are some strategies that you have found effective for auditing the usage of protected information?
A corollary to tracking and auditing of protected data in the GDPR is the need to allow for deletion of an individual’s information. How can we ensure effective deletion of these records when dealing with multiple storage systems?
What are some of the system components that are most helpful in implementing and maintaining technical and policy controls for data protection?
How do data protection regulations impact or restrict the technology choices that are viable for the data preparation layer?
Who in the organization is responsible for the proper compliance to GDPR and other data protection regimes?
Downstream from the storage and management platforms that we build as data engineers are data scientists and analysts who might request access to protected information. How do the regulations impact the types of analytics that they can use?
Contact Info
Karen
Email
Website
Mark
Email
Website
GDPR Now Podcast
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Data Protection
GDPR
This Is DPO
Intellectual Property
European Convention Of Human Rights
CCPA == California Consumer Privacy Act
PII == Personally Identifiable Information
Privacy By Design
US Privacy Shield
Principle of Least Privilege
International Association of Privacy Professionals
Privacy Technology Vendor Report
Data Provenance
Chief Data Officer
UK ICO (Information Commissioner’s Office)
AI Audit Framework
Data Council
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 4, 2019 • 49min
Automating Your Production Dataflows On Spark
Summary
As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com today to find out more.
Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Sean Knapp about Ascend, which he is billing as an autonomous dataflow service
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what the Ascend platform is?
What was your inspiration for creating it and what keeps you motivated?
What was your criteria for determining the best execution substrate for the Ascend platform?
Can you describe any limitations that are imposed by your selection of Spark as the processing engine?
If you were to rewrite Spark from scratch today to fit your particular requirements, what would you change about it?
Can you describe the technical implementation of Ascend?
How has the system design evolved since you first began working on it?
What are some of the assumptions that you had at the beginning of your work on Ascend that have been challenged or updated as a result of working with the technology and your customers?
How does the programming interface for Ascend differ from that of a vanilla Spark deployment?
What are the main benefits that a data engineer would get from using Ascend in place of running their own Spark deployment?
How do you enforce the lack of side effects in the transforms that comprise the dataflow?
Can you describe the pipeline orchestration system that you have built into Ascend and the benefits that it provides to data engineers?
What are some of the most challenging aspects of building and launching Ascend that you have dealt with?
What are some of the most interesting or unexpected lessons learned or edge cases that you have encountered?
What are some of the capabilities that you are most proud of and which have gained the greatest adoption?
What are some of the sharp edges that remain in the platform?
When is Ascend the wrong choice?
What do you have planned for the future of Ascend?
Contact Info
LinkedIn
@seanknapp on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Ascend
Kubernetes
BigQuery
Apache Spark
Apache Beam
Go Language
SHA Hashes
PySpark
Delta Lake
Podcast Episode
DAG == Directed Acyclic Graph
PrestoDB
MinIO
Podcast Episode
Parquet
Snappy Compression
Tensorflow
Kafka
Druid
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

76 snips
Oct 28, 2019 • 1h 8min
Build Maintainable And Testable Data Applications With Dagster
Summary
Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Dagster is and the origin story for the project?
In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for?
Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?
What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts?
What are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster?
For someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility?
For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?
Once they have something working, what is involved in deploying it?
One of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or "solids". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application?
Another difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard?
It is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool?
Can you give an overview of where you see Dagster fitting in the overall ecosystem of data tools?
What are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners?
Your recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations?
You have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business?
What should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster?
What is on your roadmap that you consider necessary before creating a 1.0 release?
Contact Info
@schrockn on Twitter
schrockn on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Dagster
Elementl
ETL
GraphQL
React
Matei Zaharia
DataOps Episode
Kafka
Fivetran
Podcast Episode
Spark
Supervised Learning
DevOps
Luigi
Airflow
Dask
Podcast Episode
Kubernetes
Ray
Maxime Beauchemin
Podcast Interview
Dagster Testing Guide
Great Expectations
Podcast.__init__ Interview
Papermill
Notebooks At Netflix Episode
DBT
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast


