

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Feb 3, 2020 • 57min
The Benefits And Challenges Of Building A Data Trust
Summary
Every business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with other data sources. Data trusts are a legal framework for allowing businesses to collaboratively pool their data. This allows the members of the trust to increase the value of their individual repositories and gain new insights which would otherwise require substantial effort in duplicating the data owned by their peers. In this episode Tom Plagge and Greg Mundy explain how the BrightHive platform serves to establish and maintain data trusts, the technical and organizational challenges they face, and the outcomes that they have witnessed. If you are curious about data sharing strategies or data collaboratives, then listen now to learn more!
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Tom Plagge and Gregory Mundy about BrightHive, a platform for building data trusts
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what a data trust is?
Why might an organization want to build one?
What is BrightHive and what is its origin story?
Beyond having a storage location with access controls, what are the components of a data trust that are necessary for them to be viable?
What are some of the challenges that are common in establishing an agreement among organizations who are participating in a data trust?
What are the responsibilities of each of the participants in a data trust?
For an individual or organization who wants to participate in an existing trust, what is involved in gaining access?
How does BrightHive support the process of building a data trust?
How is ownership of derivative data sets/data products and associated intellectual property handled in the context of a trust?
How is the technical architecture of BrightHive implemented and how has it evolved since it first started?
What are some of the ways that you approach the challenge of data privacy in these sharing agreements?
What are some legal and technical guards that you implement to encourage ethical uses of the data contained in a trust?
What is the motivation for releasing the technical elements of BrightHive as open source?
What are some of the most interesting, innovative, or inspirational ways that you have seen BrightHive used?
Being a shared platform for empowering other organizations to collaborate I imagine there is a strong focus on long-term sustainability. How are you approaching that problem and what is the business model for BrightHive?
What have you found to be the most interesting/unexpected/challenging aspects of building and growing the technical and business infrastructure of BrightHive?
What do you have planned for the future of BrightHive?
Contact Info
Tom
LinkedIn
tplagge on GitHub
Gregory
LinkedIn
gregmundy on GitHub
@graygoree on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
BrightHive
Data Science For Social Good
Workforce Data Initiative
NASA
NOAA
Data Trust
Data Collaborative
Public Benefit Corporation
Terraform
Airflow
Podcast.__init__ Episode
Dagster
Podcast Episode
Secure Multi-Party Computation
Public Key Encryption
AWS Macie
Blockchain
Smart Contracts
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 27, 2020 • 47min
Pay Down Technical Debt In Your Data Pipeline With Great Expectations
Summary
Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing James Campbell about Great Expectations, the open source test framework for your data pipelines which helps you continually monitor and validate the integrity and quality of your data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Great Expecations is and the origin of the project?
What has changed in the implementation and focus of Great Expectations since we last spoke on Podcast.__init__ 2 years ago?
Prior to your introduction of Great Expectations what was the state of the industry with regards to testing, monitoring, or validation of the health and quality of data and the platforms operating on them?
What are some of the types of checks and assertions that can be made about a pipeline using Great Expectations?
What are some of the non-obvious use cases for Great Expectations?
What aspects of a data pipeline or the context that it operates in are unable to be tested or validated in a programmatic fashion?
Can you describe how Great Expectations is implemented?
For anyone interested in using Great Expectations, what is the workflow for incorporating it into their environments?
What are some of the test cases that are often overlooked which data engineers and pipeline operators should be considering?
Can you talk through some of the ways that Great Expectations can be extended?
What are some notable extensions or integrations of Great Expectations?
Beyond the testing and validation of data as it is being processed you have also included features that support documentation and collaboration of the data lifecycles. What are some of the ways that those features can benefit a team working with Great Expectations?
What are some of the most interesting/innovative/unexpected ways that you have seen Great Expectations used?
What are the limitations of Great Expectations?
What are some cases where Great Expectations would be the wrong choice?
What do you have planned for the future of Great Expectations?
Contact Info
LinkedIn
@jpcampbell42 on Twitter
jcampbell on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Great Expectations
GitHub
Twitter
Podcast.__init__ Interview on Great Expectations
Superconductive Health
Abe Gong
Pandas
Podcast.__init__ Interview
SQLAlchemy
PostgreSQL
Podcast Episode
RedShift
BigQuery
Spark
Cloudera
DataBricks
Great Expectations Data Docs
Great Expectations Data Profiling
Apache NiFi
Amazon Deequ
Tensorflow Data Validation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 20, 2020 • 39min
Replatforming Production Dataflows
Summary
Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Sheel Choksi and Sean Knapp about Mayvenn’s experience migrating their dataflows onto the Ascend platform
Interview
Introduction
How did you get involved in the area of data management?
Can you start off by describing what Mayvenn is and give a sense of how you are using data?
What are the sources of data that you are working with?
What are the biggest challenges you are facing in collecting, processing, and analyzing your data?
Before adopting Ascend, what did your overall platform for data management look like?
What were the pain points that you were facing which led you to seek a new solution?
What were the selection criteria that you set forth for addressing your needs at the time?
What were the aspects of Ascend which were most appealing?
What are some of the edge cases that you have dealt with in the Ascend platform?
Now that you have been using Ascend for a while, what components of your previous architecture have you been able to retire?
Can you talk through the migration process of incorporating Ascend into your platform and any validation that you used to ensure that your data operations remained accurate and consistent?
How has the migration to Ascend impacted your overall capacity for processing data or integrating new sources into your analytics?
What are your future plans for how to use data across your organization?
Contact Info
Sheel
LinkedIn
sheelc on GitHub
Sean
LinkedIn
@seanknapp on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Mayvenn
Ascend
Podcast Episode
Google Sawzall
Clickstream
Apache Kafka
Alooma
Podcast Episode
Amazon Redshift
ELT == Extract, Load, Transform
DBT
Podcast Episode
Amazon Data Pipeline
Upsolver
Pentaho
Stitch Data
Fivetran
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jan 13, 2020 • 1h 1min
Planet Scale SQL For The New Generation Of Applications With YugabyteDB
SummaryThe modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. In this episode Karthik Ranganathan explains how Yugabyte is architected, their motivations for being fully open source, and how they simplify the process of scaling your application from greenfield to global.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.Your host is Tobias Macey and today I’m interviewing Karthik Ranganathan about YugabyteDB, the open source, high-performance distributed SQL database for global, internet-scale apps.InterviewIntroductionHow did you get involved in the area of data management?Can you start by describing what YugabyteDB is and its origin story?A growing trend in database engines (e.g. FaunaDB, CockroachDB) has been an out of the box focus on global distribution. Why is that important and how does it work in Yugabyte? What are the caveats?What are the most notable features of YugabyteDB that would lead someone to choose it over any of the myriad other options? What are the use cases that it is uniquely suited to?What are some of the systems or architecture patterns that can be replaced with Yugabyte?How does the design of Yugabyte or the different ways it is being used influence the way that users should think about modeling their data?Yugabyte is an impressive piece of engineering. Can you talk through the major design elements and how it is implemented?Easy scaling and failover is a feature that many database engines would like to be able to claim. What are the difficult elements that prevent them from implementing that capability as a standard practice? What do you have to sacrifice in order to support the level of scale and fault tolerance that you provide?Speaking of scaling, there are many ways to define that term, from vertical scaling of storage or compute, to horizontal scaling of compute, to scaling of reads and writes. What are the primary scaling factors that you focus on in Yugabyte?How do you approach testing and validation of the code given the complexity of the system that you are building?In terms of the query API you have support for a Postgres compatible SQL dialect as well as a Cassandra based syntax. What are the benefits of targeting compatibility with those platforms? What are the challenges and benefits of maintaining compatibility with those other platforms?Can you describe how the storage layer is implemented and the division between the different query formats?What are the operational characteristics of YugabyteDB? What are the complexities or edge cases that users should be aware of when planning a deployment?One of the challenges of working with large volumes of data is creating and maintaining backups. How does Yugabyte handle that problem?Most open source infrastructure projects that are backed by a business withhold various "enterprise" features such as backups and change data capture as a means of driving revenue. Can you talk through your motivation for releasing those capabilities as open source?What is the business model that you are using for YugabyteDB and how does it differ from the tribal knowledge of how open source companies generally work?What are some of the most interesting, innovative, or unexpected ways that you have seen yugabyte used?When is Yugabyte the wrong choice?What do you have planned for the future of the technical and business aspects of Yugabyte?Contact Info@karthikr on TwitterLinkedInrkarthik007 on GitHubParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.To help other people find the show please leave a review on iTunes and tell your friends and co-workersJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatLinksYugabyteDBGitHubNutanixFacebook EngineeringApache CassandraApache HBaseDelphiFuanaDBPodcast EpisodeCockroachDBPodcast EpisodeHA == High AvailabilityOracleMicrosoft SQL ServerPostgreSQLPodcast EpisodeMongoDBAmazon AuroraPGCryptoPostGISpl/pgsqlForeign Data WrappersPipelineDBPodcast EpisodeCitusPodcast EpisodeJepsen TestingYugabyte Jepsen Test ResultsOLTP == Online Transaction ProcessingOLAP == Online Analytical ProcessingDocDBGoogle SpannerGoogle BigTableSpot InstancesKubernetesCloudformationTerraformPrometheusDebeziumPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Jan 6, 2020 • 53min
Change Data Capture For All Of Your Databases With Debezium
Summary
Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Randall Hauch and Gunnar Morling about Debezium, an open source distributed platform for change data capture
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Change Data Capture is and some of the ways that it can be used?
What is Debezium and what problems does it solve?
What was your motivation for creating it?
What are some of the use cases that it enables?
What are some of the other options on the market for handling change data capture?
Can you describe the systems architecture of Debezium and how it has evolved since it was first created?
How has the tight coupling with Kafka impacted the direction and capabilities of Debezium?
What, if any, other substrates does Debezium support (e.g. Pulsar, Bookkeeper, Pravega)?
What are the data sources that are supported by Debezium?
Given that you have branched into non-relational stores, how have you approached organization of the code to allow for handling the specifics of those engines while retaining a common core set of functionality?
What is involved in deploying, integrating, and maintaining an installation of Debezium?
What are the scaling factors?
What are some of the edge cases that users and operators should be aware of?
Debezium handles the ingestion and distribution of database changesets. What are the downstream challenges or complications that application designers or systems architects have to deal with to make use of that information?
What are some of the design tensions that exist in the Debezium community between acting as a simple pipe vs. adding functionality for interpreting/aggregating/formatting the information contained in the changesets?
What are some of the common downstream systems that consume the outputs of Debezium?
What challenges or complexities are involved in building clients that can consume the changesets from the different engines that you support?
What are some of the most interesting, unexpected, or innovative ways that you have seen Debezium used?
What have you found to be the most challenging, complex, or complicated aspects of building, maintaining, and growing Debezium?
What is in store for the future of Debezium?
Contact Info
Randall
LinkedIn
@rhauch on Twitter
rhauch on GitHub
Gunnar
gunnarmorling on GitHub
@gunnarmorling on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Debezium
Confluent
Kafka Connect
RedHat
Bean Validation
Change Data Capture
DBMS == DataBase Management System
Apache Kafka
Apache Flink
Podcast Episode
Yugabyte DB
PostgreSQL
Podcast Episode
MySQL
Microsoft SQL Server
Apache Pulsar
Podcast Episode
Pravega
Podcast Episode
NATS
Amazon Kinesis
Pulsar IO
WePay
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 30, 2019 • 46min
Building The DataDog Platform For Processing Timeseries Data At Massive Scale
Summary
DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog
Interview
Introduction
How did you get involved in the area of data management?
For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with?
What are the main components of your platform for managing that information?
How are the data teams at DataDog organized and what are your primary responsibilities in the organization?
What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing?
What are some of the strategies which have proven to be most useful in overcoming those challenges?
Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met?
Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information?
Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered?
What are some of the upcoming projects that you have planned for the upcoming months and years?
What are some of the technologies, patterns, or practices that you are hoping to adopt?
Contact Info
LinkedIn
@databuryat on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
DataDog
Hadoop
Hive
Yarn
Chef
SRE == Site Reliability Engineer
Application Performance Management (APM)
Apache Kafka
RocksDB
Cassandra
Apache Parquet data serialization format
SLA == Service Level Agreement
WatchDog
Apache Spark
Podcast Episode
Apache Pig
Databricks
JVM == Java Virtual Machine
Kubernetes
SSIS (SQL Server Integration Services)
Pentaho
JasperSoft
Apache Airflow
Podcast.__init__ Episode
Apache NiFi
Podcast Episode
Luigi
Dagster
Podcast Episode
Prefect
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 23, 2019 • 48min
Building The Materialize Engine For Interactive Streaming Analytics In SQL
Summary
Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Materialize is and the problems that you are aiming to solve with it?
What was your motivation for creating it?
What use cases does Materialize enable?
What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize?
How does it fit into the broader ecosystem of data tools and platforms?
What are some of the use cases that Materialize is uniquely able to support?
How is Materialize architected and how has the design evolved since you first began working on it?
Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?
What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?
In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?
A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize?
Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine?
What are the considerations and constraints on maintaining the full history of the source data within Materialize?
For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources?
What is the workflow for defining and maintaining a set of views?
What are some of the complexities that users might face in ensuring the ongoing functionality of those views?
For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of?
The Materialize product is currently pre-release. What are the remaining steps before launching it?
What do you have planned for the future of the product and company?
Contact Info
frankmcsherry on GitHub
@frankmcsherry on Twitter
Blog
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Materialize
Timely Dataflow
Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks
[Naiad](Programs from SequentialBuilding Blocks): A Timely Dataflow System
Differential Privacy
PageRank
Data Council Presentation on Materialize
Change Data Capture
Debezium
Apache Spark
Podcast Episode
Flink
Podcast Episode
Go language
Rust
Haskell
Rust Borrow Checker
GDB (GNU Debugger)
Avro
Apache Calcite
ANSI SQL 92
Correlated Subqueries
OOM (Out Of Memory) Killer
Log-Structured Merge Tree
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 16, 2019 • 1h 2min
Solving Data Lineage Tracking And Data Discovery At WeWork
Summary
Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Marquez is?
What was missing in existing metadata management platforms that necessitated the creation of Marquez?
How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?
How does it compare to the Amundsen platform that Lyft recently released?
What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see?
What are some of the capabilities that are unique to Marquez and how are you using them at WeWork?
What are the primary resource types that you support in Marquez?
What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?
Can you explain how Marquez is architected and how the design has evolved since you first began working on it?
Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?
What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?
How is the metadata itself stored and managed in Marquez?
How much up-front data modeling is necessary and what types of schema representations are supported?
Can you talk through the overall workflow of someone using Marquez in their environment?
What is involved in registering and updating datasets?
How do you define and track the health of a given dataset?
What are some of the interesting questions that can be answered from the information stored in Marquez?
What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases?
For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it?
What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform?
When is Marquez the wrong choice for a metadata repository?
What do you have planned for the future of Marquez?
Contact Info
Julien Le Dem
@J_ on Twitter
Email
julienledem on GitHub
Willy
LinkedIn
@wslulciuc on Twitter
wslulciuc on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Marquez
DataEngConf Presentation
WeWork
Canary
Yahoo
Dremio
Hadoop
Pig
Parquet
Podcast Episode
Airflow
Apache Atlas
Amundsen
Podcast Episode
Uber DataBook
LinkedIn DataHub
Iceberg Table Format
Podcast Episode
Delta Lake
Podcast Episode
Great Expectations data pipeline unit testing framework
Podcast.__init__ Episode
Redshift
SnowflakeDB
Podcast Episode
Apache Kafka Schema Registry
Podcast Episode
Open Tracing
Jaeger
Zipkin
DropWizard Java framework
Marquez UI
Cayley Graph Database
Kubernetes
Marquez Helm Chart
Marquez Docker Container
Dagster
Podcast Episode
Luigi
DBT
Podcast Episode
Thrift
Protocol Buffers
The intro and outro music is from

Dec 9, 2019 • 59min
SnowflakeDB: The Data Warehouse Built For The Cloud
Summary
Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?
How does it compare to the other available platforms for data warehousing?
How does it differ from traditional data warehouses?
How does the performance and flexibility affect the data modeling requirements?
Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces?
Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?
What are some of the current limitations that you are struggling with?
For someone getting started with Snowflake what is involved with loading data into the platform?
What is their workflow for allocating and scaling compute capacity and running anlyses?
One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen?
What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about?
When is SnowflakeDB the wrong choice?
What are some of the plans for the future of SnowflakeDB?
Contact Info
LinkedIn
Website
@KentGraziano on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
SnowflakeDB
Free Trial
Stack Overflow
Data Warehouse
Oracle DB
MPP == Massively Parallel Processing
Shared Nothing Architecture
Multi-Cluster Shared Data Architecture
Google BigQuery
AWS Redshift
AWS Redshift Spectrum
Presto
Podcast Episode
SnowflakeDB Semi-Structured Data Types
Hive
ACID == Atomicity, Consistency, Isolation, Durability
3rd Normal Form
Data Vault Modeling
Dimensional Modeling
JSON
AVRO
Parquet
SnowflakeDB Virtual Warehouses
CRM == Customer Relationship Management
Master Data Management
Podcast Episode
FoundationDB
Podcast Episode
Apache Spark
Podcast Episode
SSIS == SQL Server Integration Services
Talend
Informatica
Fivetran
Podcast Episode
Matillion
Apache Kafka
Snowpipe
Snowflake Data Exchange
OLTP == Online Transaction Processing
GeoJSON
Snowflake Documentation
SnowAlert
Splunk
Data Catalog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Dec 3, 2019 • 46min
Organizing And Empowering Data Engineers At Citadel
Summary
The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Michael Watson and Robert Krzyzanowski about the technical and organizational challenges that he and his team are working on at Citadel
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the size and structure of the data engineering teams at Citadel?
How have the scope and nature of responsibilities for data engineers evolved over the past few years at Citadel as more and better tools and platforms have been made available in the space and machine learning techniques have grown more sophisticated?
Can you describe the types of data that you are working with at Citadel?
What is the process for identifying, evaluating, and ingesting new sources of data?
What are some of the common core aspects of your data infrastructure?
What are some of the ways that it differs across teams or projects?
How involved are data engineers in the overall product design and delivery lifecycle?
For someone who joins your team as a data engineer, what are some of the options available to them for a career path?
What are some of the challenges that you are currently facing in managing the data lifecycle for projects at Citadel?
What are some tools or practices that you are excited to try out?
Contact Info
Michael
LinkedIn
@detroitcoder on Twitter
detroitcoder on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Citadel
Python
Hedge Fund
Quantitative Trading
Citadel Securities
Apache Airflow
Jupyter Hub
Alembic database migrations for SQLAlchemy
Terraform
DQM == Data Quality Management
Great Expectations
Podcast.__init__ Episode
Nomad
RStudio
Active Directory
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast